SlideShare a Scribd company logo
1 of 44
Download to read offline
EUROPE
APACHECON
EUROPE Oct. 22nd
- 24th
201
Fast federated SQL with
Apache Calcite
Chris Baynes
@binaryexp
chris@contiamo.com
● CTO at Contiamo
● Apache Calcite committer
Content
● Intro to Apache Calcite
● Calcite toolbox
● Performance - single data sources
● Federated queries
Intro to Apache Calcite
Secret sauce behind lots of existing tech:
Intro to Apache Calcite
Important piece in the “deconstructed database” *
* https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
Intro to Apache Calcite
Pluggable framework:
● Standard SQL query parsing (lots of different dialects via Babel)
● Query rewriting
● Materializations (even auto-generation with lattices)
● Query optimisation (rule based and cost based optimizers)
● Support datasource connections either as jdbc drivers or customer adapters
● In memory execution model
● SQL Streams
● ...
Calcite toolbox
Calcite terminology & architecture
SqlNode: SQL tree constructed by parser
RelNode: The query plan (relational algebra AST)
Program: Planner/Optimizer
● Cost based (Volcano)
● Rule based (Hep)
Babel parser (SQL parser)
Parse almost any SQL
● Not dialect specific parsing
● Avoids making assumptions about
keywords
● Handles special syntax (e.g. cast using ::
in Postgres)
Useful when:
● Impersonating a database server
● Clients are more familiar with a dialect
SqlShuttle / RelShuttle (Rewriters)
Uses visitor pattern to transform one
SqlNode/RelNode tree into another
Useful for:
● View rewrites
● Row level security
RelBuilder (Plan builder)
● Programmatically build a RelNode
● Skip SQL parsing
Useful when:
● Query is generated by model / UI
● Avoid writing SQL by hand
Statistics
Provide metadata about physical tables.
● Number of rows (100 default)
● Column uniqueness
● Distribution of values
Useful when:
● Improving planner costs
Materializations
● Define materialization with SQL
● Automatically rewrite queries to use
materializations
1. Materialize pageviews by country & city (pviews_by_loc)
2. Make query:
select country, city, cnt
from pviews_by_loc
order by country
select country, city, count(*)
from pageviews
group by country, city
order by country
SqlDialect
● Convert to dialect specific SQL string
● Many SQL dialects supported (Oracle,
Mysql, Postgres, Hive, …)
● Easy to plug in your own dialect
public class CustomDialect extends SqlDialect {
@Override public boolean supportsNestedAggregations() { return false; }
@Override public boolean supportsWindowFunctions() { return false; }
@Override public SqlNode getCastSpec(RelDataType type) {
// translate type names ...
}
@Override public void unparseCall(SqlWriter writer, SqlCall call,
int leftPrec, int rightPrec) {
// make changes to the SqlCall (SqlNode) ...
}
Lattices
● Define dimensions on star
schemas
● Combinations of dimensions form
summary tables
● Materialized summary tables are
called tiles
● Trick is to figure out which tiles to
pick
https://www.slideshare.net/julianhyde/dont-optimize-my-queries-optimize-my-data-81360892
Conclusion - tools
● Many different tools
● Can be used to build complex applications
● Consider affect on performance
Performance - Single data source
Measure - benchmark
Use a standard SQL benchmark tool: TPC
TPC-H:
● 22 queries
● data size configurable: 1 GB, 10, 30, ....
● generated for lots of dialects
Measure overhead
Configuration & hardware not so important for us
Measure - trace
Query push down
select name, id from customers order by name limit 10
JdbcToEnumerableConverter
JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10])
JdbcProject(name, id)
JdbcTableScan(table=[customers])
EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10)
JdbcToEnumerableConverter
JdbcProject(name, id)
JdbcTableScan(table=[customers])
Sort and limit in memory Full query push down✘ ✔
Benchmark runs
4 benchmark runs:
0. Direct against database
1. Query engine - unoptimized (defaults)
2. Query engine - with statistics
3. Query engine - with multi-phase optimiser
Benchmark baseline (run 0)
Benchmark run 1 - unoptimised
total = 158.1s
Benchmark run 1 - unoptimised
Results
● Slow!
● Very little push-down
● Planning consumes a lot of time
Try
● Using table statistics
Benchmark run 2 - use table stats
Benchmark run 2 - use table stats
Results
● Some queries noticeably faster
● Most work still being done in memory
● Query planning is very slow
Next, try: Planning phases
Benchmark run 3 - planning phases
Split planning into 2 phases (create separate Calcite Programs):
1. Logical phase:
Use rule-based optimiser (Hep), apply all logical optimisation rules
2. Physical phase:
Use cost-based optimiser (Volcano), apply all physical rules and conversions
(jdbc, druid, …)
Benchmark run 3 - planning phases vs stats
Benchmark run 3 - planning phases vs baseline
Benchmark run 3 - planning phases
Results
● Planning and execution are much faster
● Full query push down
● Almost all query time being spent in data source
● Some queries are faster than baseline
Conclusion - single data source
The goal: full query push down
Fastest way to get there:
● Apply most rules in rule-based optimiser
● Avoid giving cost-based optimiser too many rules
● Rely on underlying data source optimiser
Federated queries
“Single query based on information from multiple data sources”
Obstacles to efficiency
● Joins of large tables
● Large driver fetchSize
● Aggregations
● Other in-memory computations
Can lead to:
● Slow queries
● OOM
Solution
● Scales out to many nodes
● Avoid OOM for large datasets
● Some calcite support
Bonus
● Eases access to data lakes
Use Spark as the execution engine
Calcite support for Spark
Currently some options available:
● Calcite spark adapter
○ Translates Calcite plan => Spark plan
○ Sends Spark plan to cluster
○ - Needs more work to make translation complete
● Calcite SparkSqlDialect
○ Use RelToSqlConverter to generate a query in Spark SQL
○ - Spark completely re-plans/optimises the query
○ - Spark push downs are not as full
● Send SQL queries directly to Spark
○ - Can’t use Calcite toolbox
○ - Spark push downs are not as full
A hybrid approach
1. Generate a Calcite plan
2. Identify pieces related to separate data
sources
3. Squash those pieces into “Spark tables”
4. Construct a dataframe for every “Spark
table”
5. Convert the remaining plan to SparkSQL
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Join two tables from different data
sources, X & Y.
select cust_id, sale_id
from X.sales join Y.customer
using (cust_id)
where X.customer.city = ‘Berlin’
limit 10
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t1: select cust_id, sale_id
from X.sales
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t2: select cust_id
from Y.customer
where city = ‘Berlin’
Hybrid approach - example
SparkLimit(fetch=10)
SparkHashJoin
SparkTable(table=t1, ds=[X], query=[...])
SparkTable(table=t2, ds=[Y], query=[...])
select *
from t1 join t2
using (cust_id)
limit 10
Use the Spark SqlDialect to transform into a Spark query
Conclusion - Hybrid approach
● Full scalability of spark
● Push down ability of Calcite
● No bottleneck through in-memory layer
● Can be persisted to external db / file using spark
● Approach would also work for other engines (Drill, Flink, Presto)
Thanks!
@binaryexp
chris@contiamo.com
calcite.apache.org

More Related Content

What's hot

Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Ryan Cuprak
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookDatabricks
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
 

What's hot (20)

Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)
 
Presto
PrestoPresto
Presto
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
 

Similar to Fast federated SQL queries with Apache Calcite and Spark

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsMonal Daxini
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTSolidify
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 

Similar to Fast federated SQL queries with Apache Calcite and Spark (20)

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDT
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Fast federated SQL queries with Apache Calcite and Spark

  • 1. EUROPE APACHECON EUROPE Oct. 22nd - 24th 201 Fast federated SQL with Apache Calcite Chris Baynes
  • 2. @binaryexp chris@contiamo.com ● CTO at Contiamo ● Apache Calcite committer
  • 3.
  • 4. Content ● Intro to Apache Calcite ● Calcite toolbox ● Performance - single data sources ● Federated queries
  • 5. Intro to Apache Calcite Secret sauce behind lots of existing tech:
  • 6. Intro to Apache Calcite Important piece in the “deconstructed database” * * https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
  • 7. Intro to Apache Calcite Pluggable framework: ● Standard SQL query parsing (lots of different dialects via Babel) ● Query rewriting ● Materializations (even auto-generation with lattices) ● Query optimisation (rule based and cost based optimizers) ● Support datasource connections either as jdbc drivers or customer adapters ● In memory execution model ● SQL Streams ● ...
  • 9. Calcite terminology & architecture SqlNode: SQL tree constructed by parser RelNode: The query plan (relational algebra AST) Program: Planner/Optimizer ● Cost based (Volcano) ● Rule based (Hep)
  • 10. Babel parser (SQL parser) Parse almost any SQL ● Not dialect specific parsing ● Avoids making assumptions about keywords ● Handles special syntax (e.g. cast using :: in Postgres) Useful when: ● Impersonating a database server ● Clients are more familiar with a dialect
  • 11. SqlShuttle / RelShuttle (Rewriters) Uses visitor pattern to transform one SqlNode/RelNode tree into another Useful for: ● View rewrites ● Row level security
  • 12. RelBuilder (Plan builder) ● Programmatically build a RelNode ● Skip SQL parsing Useful when: ● Query is generated by model / UI ● Avoid writing SQL by hand
  • 13. Statistics Provide metadata about physical tables. ● Number of rows (100 default) ● Column uniqueness ● Distribution of values Useful when: ● Improving planner costs
  • 14. Materializations ● Define materialization with SQL ● Automatically rewrite queries to use materializations 1. Materialize pageviews by country & city (pviews_by_loc) 2. Make query: select country, city, cnt from pviews_by_loc order by country select country, city, count(*) from pageviews group by country, city order by country
  • 15. SqlDialect ● Convert to dialect specific SQL string ● Many SQL dialects supported (Oracle, Mysql, Postgres, Hive, …) ● Easy to plug in your own dialect public class CustomDialect extends SqlDialect { @Override public boolean supportsNestedAggregations() { return false; } @Override public boolean supportsWindowFunctions() { return false; } @Override public SqlNode getCastSpec(RelDataType type) { // translate type names ... } @Override public void unparseCall(SqlWriter writer, SqlCall call, int leftPrec, int rightPrec) { // make changes to the SqlCall (SqlNode) ... }
  • 16. Lattices ● Define dimensions on star schemas ● Combinations of dimensions form summary tables ● Materialized summary tables are called tiles ● Trick is to figure out which tiles to pick
  • 18. Conclusion - tools ● Many different tools ● Can be used to build complex applications ● Consider affect on performance
  • 19. Performance - Single data source
  • 20. Measure - benchmark Use a standard SQL benchmark tool: TPC TPC-H: ● 22 queries ● data size configurable: 1 GB, 10, 30, .... ● generated for lots of dialects Measure overhead Configuration & hardware not so important for us
  • 22. Query push down select name, id from customers order by name limit 10 JdbcToEnumerableConverter JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10]) JdbcProject(name, id) JdbcTableScan(table=[customers]) EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10) JdbcToEnumerableConverter JdbcProject(name, id) JdbcTableScan(table=[customers]) Sort and limit in memory Full query push down✘ ✔
  • 23. Benchmark runs 4 benchmark runs: 0. Direct against database 1. Query engine - unoptimized (defaults) 2. Query engine - with statistics 3. Query engine - with multi-phase optimiser
  • 25. Benchmark run 1 - unoptimised total = 158.1s
  • 26. Benchmark run 1 - unoptimised Results ● Slow! ● Very little push-down ● Planning consumes a lot of time Try ● Using table statistics
  • 27. Benchmark run 2 - use table stats
  • 28. Benchmark run 2 - use table stats Results ● Some queries noticeably faster ● Most work still being done in memory ● Query planning is very slow Next, try: Planning phases
  • 29. Benchmark run 3 - planning phases Split planning into 2 phases (create separate Calcite Programs): 1. Logical phase: Use rule-based optimiser (Hep), apply all logical optimisation rules 2. Physical phase: Use cost-based optimiser (Volcano), apply all physical rules and conversions (jdbc, druid, …)
  • 30. Benchmark run 3 - planning phases vs stats
  • 31. Benchmark run 3 - planning phases vs baseline
  • 32. Benchmark run 3 - planning phases Results ● Planning and execution are much faster ● Full query push down ● Almost all query time being spent in data source ● Some queries are faster than baseline
  • 33. Conclusion - single data source The goal: full query push down Fastest way to get there: ● Apply most rules in rule-based optimiser ● Avoid giving cost-based optimiser too many rules ● Rely on underlying data source optimiser
  • 34. Federated queries “Single query based on information from multiple data sources”
  • 35. Obstacles to efficiency ● Joins of large tables ● Large driver fetchSize ● Aggregations ● Other in-memory computations Can lead to: ● Slow queries ● OOM
  • 36. Solution ● Scales out to many nodes ● Avoid OOM for large datasets ● Some calcite support Bonus ● Eases access to data lakes Use Spark as the execution engine
  • 37. Calcite support for Spark Currently some options available: ● Calcite spark adapter ○ Translates Calcite plan => Spark plan ○ Sends Spark plan to cluster ○ - Needs more work to make translation complete ● Calcite SparkSqlDialect ○ Use RelToSqlConverter to generate a query in Spark SQL ○ - Spark completely re-plans/optimises the query ○ - Spark push downs are not as full ● Send SQL queries directly to Spark ○ - Can’t use Calcite toolbox ○ - Spark push downs are not as full
  • 38. A hybrid approach 1. Generate a Calcite plan 2. Identify pieces related to separate data sources 3. Squash those pieces into “Spark tables” 4. Construct a dataframe for every “Spark table” 5. Convert the remaining plan to SparkSQL
  • 39. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Join two tables from different data sources, X & Y. select cust_id, sale_id from X.sales join Y.customer using (cust_id) where X.customer.city = ‘Berlin’ limit 10
  • 40. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Identify data source pieces t1: select cust_id, sale_id from X.sales
  • 41. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Identify data source pieces t2: select cust_id from Y.customer where city = ‘Berlin’
  • 42. Hybrid approach - example SparkLimit(fetch=10) SparkHashJoin SparkTable(table=t1, ds=[X], query=[...]) SparkTable(table=t2, ds=[Y], query=[...]) select * from t1 join t2 using (cust_id) limit 10 Use the Spark SqlDialect to transform into a Spark query
  • 43. Conclusion - Hybrid approach ● Full scalability of spark ● Push down ability of Calcite ● No bottleneck through in-memory layer ● Can be persisted to external db / file using spark ● Approach would also work for other engines (Drill, Flink, Presto)