SlideShare a Scribd company logo
1 of 19
Download to read offline
Impala: A Modern SQL
  Engine for Hadoop
     Marcel Kornacker
      Cloudera, Inc.
       12/03/2012
Impala Overview: Goals
● General-purpose SQL query engine:
    ○ should work both for analytic and transactional
       workloads
    ○ will support queries that take from microseconds to
       hours
●   Runs directly within Hadoop:
    ○ reads widely used Hadoop file formats
    ○ talks to widely used Hadoop storage managers
    ○ runs on same nodes that run Hadoop processes
●   High performance:
    ○ C++ instead of Java
    ○ runtime code generation
    ○ completely new execution engine that doesn't build
       on MapReduce
User View of Impala: Overview
● Runs as a distributed service in cluster: one Impala
    daemon on each node with data
●   User submits query via ODBC/Beeswax Thrift API to
    any of the daemons
●   Query is distributed to all nodes with relevant data
●   If any node fails, the query fails
●   Impala uses Hive's metadata interface, connects to
    Hive's metastore
●   Supported file formats:
     ○ text files (GA: with compression, including lzo)
     ○ sequence files with snappy/gzip compression
     ○ GA: Avro data files
     ○ GA: Trevni (columnar format; more on that later)
User View of Impala: SQL
● SQL support:
    ○ patterned after Hive's version of SQL
    ○ limited to Select, Project, Join, Union, Subqueries,
       Aggregation and Insert
    ○ Order By only with Limit
    ○ GA: DDL support (CREATE, ALTER)
●   Functional limitations:
    ○ no custom UDFs, file formats, SerDes
    ○ no beyond SQL (buckets, samples, transforms,
       arrays, structs, maps, xpath, json)
    ○ only hash joins; joined table has to fit in memory:
       ■ beta: of single node
       ■ GA: aggregate memory of all (executing) nodes
    ○ beta: join order = FROM clause order
    ○ GA: rudimentary cost-based optimizer
User View of Impala: HBase
● HBase functionality:
    ○ uses Hive's mapping of HBase table into metastore
      table
    ○ predicates on rowkey columns are mapped into
      start/stop row
    ○ predicates on other columns are mapped into
      SingleColumnValueFilters
●   HBase functional limitations:
    ○ no nested-loop joins
    ○ all data stored as text
Impala Architecture
● Two binaries: impalad and statestored
● Impala daemon (impalad)
    a. handles client requests and all internal requests
       related to query execution
    b. exports Thrift services for these two roles
●   State store daemon (statestored)
    a. provides name service and metadata distribution
    b. also exports a Thrift service
Impala Architecture
● Query execution phases
   ○ request arrives via odbc/beeswax Thrift API
   ○ planner turns request into collections of plan
     fragments
   ○ coordinator initiates execution on remote impalad's
   ○ during execution
      ■ intermediate results are streamed between
         executors
      ■ query results are streamed back to client
      ■ subject to limitations imposed to blocking
         operators (top-n, aggregation)
Impala Architecture
● Planner:
   ○ join order = FROM clause order
     GA target: rudimentary cost-based optimizer
   ○ plan operators: Scan, HashJoin, HashAggregation,
     Union, TopN, Exchange
   ○ distributed aggregation: pre-aggregation in all nodes,
     merge aggregation in single node
     GA: hash-partitioned aggregation
   ○ partitions plans along scan boundaries
Impala Architecture
● Example: query with join and aggregation
   SELECT state, SUM(revenue)
   FROM HdfsTbl h JOIN HbaseTbl b ON (...)
   GROUP BY 1 ORDER BY 2 desc LIMIT 10

    TopN
                                                Agg
                          TopN
        Agg                                     Hash
                           Agg                  Join
        Hash
        Join
                                         Hdfs                     Hbase
                          Exch                         Exch
                                         Scan                     Scan
 Hdfs          Hbase   at coordinator   at DataNodes          at region servers
 Scan          Scan
Impala Architecture: Query Execution
Request arrives via odbc/beeswax Thrift API


     SQL App                             Hive
                                                    HDFS NN       Statestore
      ODBC                             Metastore


                  SQL request



     Query Planner                Query Planner           Query Planner
    Query Coordinator           Query Coordinator       Query Coordinator
     Query Executor              Query Executor          Query Executor

   HDFS DN      HBase           HDFS DN     HBase       HDFS DN       HBase
Impala Architecture: Query Execution
Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's

     SQL App                     Hive
                                            HDFS NN       Statestore
      ODBC                     Metastore




     Query Planner        Query Planner           Query Planner
    Query Coordinator   Query Coordinator       Query Coordinator
     Query Executor      Query Executor          Query Executor

   HDFS DN      HBase   HDFS DN     HBase       HDFS DN       HBase
Impala Architecture: Query Execution
Intermediate results are streamed between impalad's
Query results are streamed back to client

     SQL App                         Hive
                                                HDFS NN     Statestore
      ODBC                         Metastore

                   query
                  results


     Query Planner            Query Planner           Query Planner
    Query Coordinator       Query Coordinator       Query Coordinator
     Query Executor          Query Executor          Query Executor

   HDFS DN      HBase       HDFS DN     HBase       HDFS DN     HBase
Impala Architecture
● Metadata handling:
   ○ utilizes Hive's metastore
   ○ caches metadata: no synchronous metastore API
     calls during query execution
   ○ beta: impalad's read metadata from metastore at
     startup
   ○ GA: metadata distribution through statestore
Impala Architecture
● Execution engine
   ○ written in C++
   ○ runtime code generation for "big loops"
      ■ example: insert batch of rows into hash table
      ■ code generation with llvm
      ■ inlines all expressions; no function calls inside
         loop
   ○ all data is copied into canonical in-memory tuple
     format; all fixed-width data is located at fixed offsets
   ○ uses intrinsics/special cpu instructions for text
     parsing, crc32 computation, etc.
Impala's Statestore
● Central system state repository
    ○ name service (membership)
    ○ GA: metadata
    ○ GA: other scheduling-relevant or diagnostic state
●   Soft-state
    ○ all impalad's register at startup
    ○ impalad's re-register after losing connection
    ○ Impala service continues to function in absence of
       statestored (but: with increasingly stale state)
    ○ State pushed to impalad's periodically
    ○ Repeated failed heartbeat means impalad evicted
       from cluster view
●   Thrift API for service / subscription registration
Comparing Impala to Dremel
● What is Dremel:
     ○ columnar storage for data with nested structures
     ○ distributed scalable aggregation on top of that
●   Columnar storage in Hadoop: Trevni
     ○ new columnar format created by Doug Cutting
     ○ stores data in appropriate native/binary types
     ○ can also store nested structures similar to Dremel's
        ColumnIO
●   Distributed aggregation: Impala
●   Impala plus Trevni: a superset of the published version
    of Dremel (which didn't support joins)
Comparing Impala to Hive
● Hive: MapReduce as an execution engine
     ○ High latency, low throughput queries
     ○ Fault-tolerance model based on MapReduce's on-
       disk checkpointing; materializes all intermediate
       results
     ○ Java runtime allows for easy late-binding of
       functionality: file formats and UDFs.
     ○ Extensive layering imposes high runtime overhead
●   Impala:
     ○ direct, process-to-process data exchange
     ○ no fault tolerance
     ○ an execution engine designed for low runtime
       overhead
Comparing Impala to Hive
● Impala's performance advantage over Hive: no hard
   numbers, but
   ○ Impala can get full disk throughput
     (~100MB/sec/disk); I/O-bound workloads often faster
     by 3-4x
   ○ queries that require multiple map-reduce phases in
     Hive see a higher speedup
   ○ queries that run against in-memory data see a
     higher speedup (observed up to 100x)
Try it out!
●   Beta version available since 10/24/12
●   We're targeting GA for end of Q1 2013
●   Questions/comments? impala-user@cloudera.org
●   My email address: marcel@cloudera.com

More Related Content

What's hot

Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impalaNAVER D2
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopMukund Babbar
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 

What's hot (20)

Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 

Similar to An Introduction to Impala – Low Latency Queries for Apache Hadoop

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopSpagoWorld
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.Roman Nikitchenko
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.Roman Nikitchenko
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 

Similar to An Introduction to Impala – Low Latency Queries for Apache Hadoop (20)

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 

More from Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

More from Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Recently uploaded

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 

Recently uploaded (20)

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 

An Introduction to Impala – Low Latency Queries for Apache Hadoop

  • 1. Impala: A Modern SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. 12/03/2012
  • 2. Impala Overview: Goals ● General-purpose SQL query engine: ○ should work both for analytic and transactional workloads ○ will support queries that take from microseconds to hours ● Runs directly within Hadoop: ○ reads widely used Hadoop file formats ○ talks to widely used Hadoop storage managers ○ runs on same nodes that run Hadoop processes ● High performance: ○ C++ instead of Java ○ runtime code generation ○ completely new execution engine that doesn't build on MapReduce
  • 3. User View of Impala: Overview ● Runs as a distributed service in cluster: one Impala daemon on each node with data ● User submits query via ODBC/Beeswax Thrift API to any of the daemons ● Query is distributed to all nodes with relevant data ● If any node fails, the query fails ● Impala uses Hive's metadata interface, connects to Hive's metastore ● Supported file formats: ○ text files (GA: with compression, including lzo) ○ sequence files with snappy/gzip compression ○ GA: Avro data files ○ GA: Trevni (columnar format; more on that later)
  • 4. User View of Impala: SQL ● SQL support: ○ patterned after Hive's version of SQL ○ limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert ○ Order By only with Limit ○ GA: DDL support (CREATE, ALTER) ● Functional limitations: ○ no custom UDFs, file formats, SerDes ○ no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) ○ only hash joins; joined table has to fit in memory: ■ beta: of single node ■ GA: aggregate memory of all (executing) nodes ○ beta: join order = FROM clause order ○ GA: rudimentary cost-based optimizer
  • 5. User View of Impala: HBase ● HBase functionality: ○ uses Hive's mapping of HBase table into metastore table ○ predicates on rowkey columns are mapped into start/stop row ○ predicates on other columns are mapped into SingleColumnValueFilters ● HBase functional limitations: ○ no nested-loop joins ○ all data stored as text
  • 6. Impala Architecture ● Two binaries: impalad and statestored ● Impala daemon (impalad) a. handles client requests and all internal requests related to query execution b. exports Thrift services for these two roles ● State store daemon (statestored) a. provides name service and metadata distribution b. also exports a Thrift service
  • 7. Impala Architecture ● Query execution phases ○ request arrives via odbc/beeswax Thrift API ○ planner turns request into collections of plan fragments ○ coordinator initiates execution on remote impalad's ○ during execution ■ intermediate results are streamed between executors ■ query results are streamed back to client ■ subject to limitations imposed to blocking operators (top-n, aggregation)
  • 8. Impala Architecture ● Planner: ○ join order = FROM clause order GA target: rudimentary cost-based optimizer ○ plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange ○ distributed aggregation: pre-aggregation in all nodes, merge aggregation in single node GA: hash-partitioned aggregation ○ partitions plans along scan boundaries
  • 9. Impala Architecture ● Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Agg Join Hash Join Hdfs Hbase Exch Exch Scan Scan Hdfs Hbase at coordinator at DataNodes at region servers Scan Scan
  • 10. Impala Architecture: Query Execution Request arrives via odbc/beeswax Thrift API SQL App Hive HDFS NN Statestore ODBC Metastore SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 11. Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad's SQL App Hive HDFS NN Statestore ODBC Metastore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 12. Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App Hive HDFS NN Statestore ODBC Metastore query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Impala Architecture ● Metadata handling: ○ utilizes Hive's metastore ○ caches metadata: no synchronous metastore API calls during query execution ○ beta: impalad's read metadata from metastore at startup ○ GA: metadata distribution through statestore
  • 14. Impala Architecture ● Execution engine ○ written in C++ ○ runtime code generation for "big loops" ■ example: insert batch of rows into hash table ■ code generation with llvm ■ inlines all expressions; no function calls inside loop ○ all data is copied into canonical in-memory tuple format; all fixed-width data is located at fixed offsets ○ uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc.
  • 15. Impala's Statestore ● Central system state repository ○ name service (membership) ○ GA: metadata ○ GA: other scheduling-relevant or diagnostic state ● Soft-state ○ all impalad's register at startup ○ impalad's re-register after losing connection ○ Impala service continues to function in absence of statestored (but: with increasingly stale state) ○ State pushed to impalad's periodically ○ Repeated failed heartbeat means impalad evicted from cluster view ● Thrift API for service / subscription registration
  • 16. Comparing Impala to Dremel ● What is Dremel: ○ columnar storage for data with nested structures ○ distributed scalable aggregation on top of that ● Columnar storage in Hadoop: Trevni ○ new columnar format created by Doug Cutting ○ stores data in appropriate native/binary types ○ can also store nested structures similar to Dremel's ColumnIO ● Distributed aggregation: Impala ● Impala plus Trevni: a superset of the published version of Dremel (which didn't support joins)
  • 17. Comparing Impala to Hive ● Hive: MapReduce as an execution engine ○ High latency, low throughput queries ○ Fault-tolerance model based on MapReduce's on- disk checkpointing; materializes all intermediate results ○ Java runtime allows for easy late-binding of functionality: file formats and UDFs. ○ Extensive layering imposes high runtime overhead ● Impala: ○ direct, process-to-process data exchange ○ no fault tolerance ○ an execution engine designed for low runtime overhead
  • 18. Comparing Impala to Hive ● Impala's performance advantage over Hive: no hard numbers, but ○ Impala can get full disk throughput (~100MB/sec/disk); I/O-bound workloads often faster by 3-4x ○ queries that require multiple map-reduce phases in Hive see a higher speedup ○ queries that run against in-memory data see a higher speedup (observed up to 100x)
  • 19. Try it out! ● Beta version available since 10/24/12 ● We're targeting GA for end of Q1 2013 ● Questions/comments? impala-user@cloudera.org ● My email address: marcel@cloudera.com