SlideShare a Scribd company logo
1 of 28
Download to read offline
SQL	
  on	
  Hadoop	
  
Defining	
  the	
  New	
  Genera/on	
  of	
  	
  
Analy/c	
  Databases	
  

       Strata Conference, February 2013
Speaker Bio: Carl Steinbach


Currently:
  
Engineer at Citus Data
  
PMC Chair, Committer -- Apache Hive Project
  
@cwsteinbach on Twitter

Formerly:
  
Cloudera, Informatica, NetApp, Oracle


                                              2	
  
This is going to sound strange, but…



    I used to think
databases were boring


                                        3	
  
Why?



Undergrad at MIT 1997-2001

Number of Database Classes: 0

Number of Database Faculty Members: 0

My Conclusion: Databases are a Dead Field


                                             4	
  
Things Changed Over the Next Couple of Years


I got a job!

Database Group Formed at MIT (2003)
   
- Mike Stonebraker
   
- Sam Madden

New Class: 6.830 Database Systems (2005)


                                            5	
  
What Changed?

Web-scale Data

New DB Research: Columnar Storage, NoSQL

MPP Analytic Databases Gained Market Traction

GFS (’03) and MapReduce (‘04) Papers

Apache Hadoop – v0.1.0 released in 2006

                                             6	
  
What’s Good About Hadoop?

Commodity Storage

Scale-out

Flexibility
  
MapReduce
  
Multi-structured Data




                             7	
  
What’s Bad About Hadoop?

MapReduce!

No Schemas!

Missing Features
  
Optimizer, Indexes, Views

Incompatibility with Existing Tools
  
BI, ETL, IDEs

                                       8	
  
Apache Hive Solved Many of These Problems

     User	
  Client	
                                         HiveServer2	
                                                  Hive	
  MetaStore	
  
       Hive	
  CLI	
                                      SQL	
  to	
  MapReduce	
                                              Table	
  to	
  Files	
  
                                     SQL	
  Queries	
                                  Catalog	
  Metadata	
  
                                                                  Compiler	
  
                                                                                                                             Table	
  to	
  Format	
  
 ETL,	
  BI,	
  SQL	
  IDE	
                                   Rule	
  Based	
  
                                                               Op/mizer	
  
 Hive	
  ODBC/JDBC	
  
                                                          MR	
  Plan	
  Execu/on	
  
                                                            Coordinator	
  




                 Map/Reduce	
                                   Map/Reduce	
  	
                                 Map/Reduce	
  

             Hive	
  Operators	
                             Hive	
  Operators	
                             Hive	
  Operators	
  

                Hive	
  SerDes	
                               Hive	
  SerDes	
                                  Hive	
  SerDes	
  


                      HDFS	
                                        HDFS	
                                           HDFS	
  
                  datanode	
                                     datanode	
                                       datanode	
  


                                                                                                                                                           9	
  
But Other Problems Remained

MapReduce: Latency Overhead

Many Missing Features:
•    ANSI SQL
•    Cost Based Optimizer
•    UDFs
•    Data Types
•    Security
•    …




                               10	
  
One Solution: Separate MPP DB Cluster

      MPP	
  Database	
  Cluster	
                     MPP	
  Master	
  Node	
  

                                                         Global	
  Query	
  
                                                           Executor	
  


        MPP	
  Worker	
  Node	
     MPP	
  Worker	
  Node	
              MPP	
  Worker	
  Node	
     MPP	
  Worker	
  Node	
  

          Local	
  Query	
            Local	
  Query	
                      Local	
  Query	
           Local	
  Query	
  
           Executor	
                  Executor	
                            Executor	
                 Executor	
  




      Hadoop	
  Cluster	
  




                HDFS	
                      HDFS	
                                 HDFS	
                   HDFS	
  
              datanode	
                  datanode	
                           datanode	
                 datanode	
  




                                                                                                                                 11	
  
One Solution: Separate MPP DB Cluster

                                                                MPP	
  Master	
  Node	
  

                                                                  Global	
  Query	
  
                                                                    Executor	
  


                 MPP	
  Worker	
  Node	
     MPP	
  Worker	
  Node	
              MPP	
  Worker	
  Node	
     MPP	
  Worker	
  Node	
  

                   Local	
  Query	
            Local	
  Query	
                      Local	
  Query	
           Local	
  Query	
  
                    Executor	
                  Executor	
                            Executor	
                 Executor	
  


 Pull	
  
Data	
  to	
  
 Work	
                        IO	
  Bo]leneck	
  




                         HDFS	
                      HDFS	
                                 HDFS	
                   HDFS	
  
                       datanode	
                  datanode	
                           datanode	
                 datanode	
  




                                                                                                                                          12	
  
Better Solution:
  A New Architecture for SQL on Hadoop

                                                              MPP	
  Master	
  Node	
  

                                                                Global	
  Query	
  
                                    Push	
                        Executor	
  
                                    Work	
  
                                     to	
  
                                    Data	
  




Maintain	
     Local	
  Query	
                Local	
  Query	
                    Local	
  Query	
     Local	
  Query	
  
                Executor	
                      Executor	
                          Executor	
           Executor	
  
Data	
  
Locality	
         HDFS	
                          HDFS	
                                 HDFS	
             HDFS	
  
                 datanode	
                      datanode	
                           datanode	
           datanode	
  




                                                                                                                             13	
  
The New Architecture in Detail: CitusDB

                                            CitusDB	
  Master	
  Node	
  
                                                                                                     Hadoop	
  
                                                  Metadata	
                                         Metadata	
  
   PostgreSQL	
  
     Tools	
                                                                                          HDFS	
  
                                            Distributed	
  Query	
  
   ODBC/JDBC	
                                    Planner	
                                         NameNode	
  
     Clients	
  
                                            Distributed	
  Query	
  
                                                 Executor	
  




        Local	
  Query	
  Planner	
        Local	
  Query	
  Planner	
        Local	
  Query	
  Planner	
  

        Local	
  Query	
  Executor	
      Local	
  Query	
  Executor	
       Local	
  Query	
  Executor	
  

      Foreign	
  Data	
  Wrappers	
      Foreign	
  Data	
  Wrappers	
      Foreign	
  Data	
  Wrappers	
  

                    HDFS	
                           HDFS	
                             HDFS	
  
               datanode	
                         datanode	
                         datanode	
  


                                                                                                                    14	
  
The New Architecture in Detail: CitusDB

                                                              CitusDB	
  Master	
  Node	
  
                                                                                                  Metadata	
  Sync	
          Hadoop	
  
                                                                    Metadata	
                                                Metadata	
  
               PostgreSQL	
  
                 Tools	
                                                                                                        HDFS	
  
                                                              Distributed	
  Query	
  
               ODBC/JDBC	
                                          Planner	
                                                 NameNode	
  
                 Clients	
  
                                                              Distributed	
  Query	
  
                                                                   Executor	
  




                     Local	
  Query	
  Planner	
             Local	
  Query	
  Planner	
               Local	
  Query	
  Planner	
  

                    Local	
  Query	
  Executor	
            Local	
  Query	
  Executor	
              Local	
  Query	
  Executor	
  

                   Foreign	
  Data	
  Wrappers	
           Foreign	
  Data	
  Wrappers	
            Foreign	
  Data	
  Wrappers	
  

                                HDFS	
                                 HDFS	
                                     HDFS	
  
                            datanode	
                              datanode	
                                 datanode	
  


Step	
  1)	
  The	
  CitusDB	
  Master	
  Node	
  retrieves	
  file	
  system	
  metadata	
  from	
  the	
  Hadoop	
  NameNode.	
             15	
  
The New Architecture in Detail: CitusDB

                                                                      CitusDB	
  Master	
  Node	
  
                                                                                                                                        Hadoop	
  
                                                                            Metadata	
                                                  Metadata	
  
                 PostgreSQL	
  
                   Tools	
                   User	
  Query	
                                                                             HDFS	
  
                                                                      Distributed	
  Query	
  
                 ODBC/JDBC	
                                                Planner	
                                                  NameNode	
  
                   Clients	
  
                                                                      Distributed	
  Query	
  
                                                                           Executor	
  




                       Local	
  Query	
  Planner	
                  Local	
  Query	
  Planner	
                  Local	
  Query	
  Planner	
  

                      Local	
  Query	
  Executor	
                  Local	
  Query	
  Executor	
                 Local	
  Query	
  Executor	
  

                     Foreign	
  Data	
  Wrappers	
                Foreign	
  Data	
  Wrappers	
                Foreign	
  Data	
  Wrappers	
  

                                  HDFS	
                                       HDFS	
                                      HDFS	
  
                              datanode	
                                    datanode	
                                  datanode	
  


Step	
  2)	
  The	
  user	
  submits	
  a	
  SQL	
  query	
  to	
  the	
  CitusDB	
  master	
  node	
  using	
  the	
  PostgreSQL	
  CLI	
  or	
  a	
  JDBC/ODBC	
  app.	
  
                                                                                                                                                                     16	
  
The New Architecture in Detail: CitusDB

                                                                  CitusDB	
  Master	
  Node	
  
                                                                                                                                      Hadoop	
  
                                                                        Metadata	
                                                    Metadata	
  
                PostgreSQL	
  
                  Tools	
                                                                                                              HDFS	
  
                                                                  Distributed	
  Query	
  
                ODBC/JDBC	
                                             Planner	
                                                    NameNode	
  
                  Clients	
  
                                                                  Distributed	
  Query	
  
                                                                       Executor	
  




                                                                                      Local	
  Queries	
  




                      Local	
  Query	
  Planner	
               Local	
  Query	
  Planner	
                    Local	
  Query	
  Planner	
  

                     Local	
  Query	
  Executor	
               Local	
  Query	
  Executor	
                  Local	
  Query	
  Executor	
  

                    Foreign	
  Data	
  Wrappers	
             Foreign	
  Data	
  Wrappers	
                  Foreign	
  Data	
  Wrappers	
  

                                 HDFS	
                                    HDFS	
                                        HDFS	
  
                             datanode	
                                 datanode	
                                    datanode	
  


Step	
  3)	
  The	
  Master	
  Node	
  generates	
  an	
  op/mized	
  global	
  query	
  plan	
  and	
  sends	
  fragment	
  queries	
  to	
  the	
  workers.	
  
                                                                                                                                                           17	
  
The New Architecture in Detail: CitusDB

                                                                  CitusDB	
  Master	
  Node	
  
                                                                                                                                      Hadoop	
  
                                                                        Metadata	
                                                    Metadata	
  
                PostgreSQL	
  
                  Tools	
                                                                                                              HDFS	
  
                                                                  Distributed	
  Query	
  
                ODBC/JDBC	
                                             Planner	
                                                    NameNode	
  
                  Clients	
  
                                                                  Distributed	
  Query	
  
                                                                       Executor	
  




                                                                                      Local	
  Results	
  




                      Local	
  Query	
  Planner	
                Local	
  Query	
  Planner	
                   Local	
  Query	
  Planner	
  

                     Local	
  Query	
  Executor	
               Local	
  Query	
  Executor	
                  Local	
  Query	
  Executor	
  

                    Foreign	
  Data	
  Wrappers	
              Foreign	
  Data	
  Wrappers	
                 Foreign	
  Data	
  Wrappers	
  

                                 HDFS	
                                    HDFS	
                                        HDFS	
  
                             datanode	
                                 datanode	
                                    datanode	
  


Step	
  4)	
  The	
  CitusDB	
  worker	
  processes	
  running	
  on	
  each	
  DataNode	
  process	
  the	
  fragment	
  queries	
                  18	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  and	
  send	
  par/al	
  result	
  sets	
  back	
  to	
  the	
  Master	
  Node.	
  
The New Architecture in Detail: CitusDB

                                                                        CitusDB	
  Master	
  Node	
  
                                                                                                                                            Hadoop	
  
                                                                              Metadata	
                                                    Metadata	
  
                 PostgreSQL	
  
                   Tools	
                   Query	
  Results	
                                                                              HDFS	
  
                                                                        Distributed	
  Query	
  
                 ODBC/JDBC	
                                                  Planner	
                                                    NameNode	
  
                   Clients	
  
                                                                        Distributed	
  Query	
  
                                                                             Executor	
  




                        Local	
  Query	
  Planner	
                   Local	
  Query	
  Planner	
                    Local	
  Query	
  Planner	
  

                       Local	
  Query	
  Executor	
                   Local	
  Query	
  Executor	
                  Local	
  Query	
  Executor	
  

                      Foreign	
  Data	
  Wrappers	
                 Foreign	
  Data	
  Wrappers	
                  Foreign	
  Data	
  Wrappers	
  

                                  HDFS	
                                         HDFS	
                                        HDFS	
  
                               datanode	
                                     datanode	
                                    datanode	
  


Step	
  5)	
  The	
  Master	
  Node	
  merges	
  the	
  par/al	
  result	
  sets	
  and	
  returns	
  the	
  final	
  result	
  to	
  the	
  user.	
        19	
  
CitusDB: Standing on the Shoulders of Giants




                             +
Mature, Battle-tested
           Proven Scalability
Enterprise Class Features
       Cost Effectiveness
Has an Elephant Mascot
          Has an Elephant Mascot




                                                     20	
  
Leveraging PostgreSQL Performance


Cost-based Query Optimizer


postgres=#	
  EXPLAIN	
  SELECT	
  	
  
                               	
  customer.c_custkey,	
  
                               	
  sum((lineitem.l_extendedprice	
  *	
  (1::numeric	
  -­‐	
  lineitem.l_discount)))	
  	
  
                               	
  ….	
  
	
  -­‐>	
  	
  Sort	
  	
  (cost=282459.19..282599.52	
  rows=56134	
  width=182)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  Sort	
  Key:	
  customer.c_custkey,	
  customer.c_name	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  Sort	
  Method:	
  external	
  merge	
  	
  Disk:	
  17192kB	
  
                               	
  ….	
  
	
                             	
  -­‐>	
  	
  Hash	
  Join	
  	
  (cost=39666.61..257246.25	
  rows=56134	
  width=16)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
                                 	
  Hash	
  Cond:	
  (lineitem.l_orderkey	
  =	
  orders.o_orderkey)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐>	
  	
  Seq	
  Scan	
  on	
  lineitem_102022	
  lineitem	
  	
  (cost=0.00..190571.11)	
  



                                                                                                                                                                         21	
  
Leveraging PostgreSQL Features:
More than 300 Built-in Functions
QUOTE_LITERAL
       REGR_SLOPE
              COS
                     GREATEST
          QUOTE_IDENT
          SET_BYTE
STRING_TO_ARRAY
     ENUM_RANGE
              EXTRACT
                 REGR_SXY
          REGR_R2
              XMLFOREST
CONVERT_TO
          NTH_VALUE
               DIV
                     OVERLAPS
          LAG
                  LAG
DATE_TRUNC
          SIN
                     BTRIM
                   FLOOR
             PI
                   FORMAT
TO_DATE
             TRANSACTION_TIMESTAMP
   LOWER
                   SQRT
              TRUNC
                ARRAY_AGG
LOWER_INC
           REGR_SYY
                CONCAT
                  RTRIM
             STRIP
                LTRIM
CHAR_LENGTH
         IS FALSE
                ARRAY_FILL
              REGR_AVGY
         XMLAGG
               BETWEEN
CURRENT_TIMESTAMP
   BROADCAST
               JUSTIFY_DAYS
            IS DISTINCT
       UPPER
                BOX
ARRAY_LENGTH
        ISCLOSED
                VAR_POP
                 TIMEOFDAY
         COVAR_POP
            CURRVAL
REPEAT
              VAR_SAMP
                OCTET_LENGTH
            LN
                NETMASK
              LOCALTIME
UPPER
               QUERY_TO_XML
            STATEMENT_TIMESTAMP
     TO_CHAR
           FIRST_VALUE
          LPAD
CASE
                GET_BIT
                 TAN
                     TRUNC
             LOWER_INF
            REGR_AVGX
BOOL_AND
            IS NOT UNKNOWN
          ARRAY_APPEND
            ISNULL
            REGR_COUNT
           DATE_PART
CORR
                ENUM_LAST
               XMLCOMMENT
              SCHEMA_TO_XML
     SET_MASKLEN
          ARRAY_TO_STRING
XPATH_EXISTS
        NUMNODE
                 REGEXP_MATCHES
          COALESCE
          NOW
                  EXTRACT
RADIUS
              SPLIT_PART
              CONVERT_FROM
            ENUM_FIRST
        ISOPEN
               UPPER_INC
MOD
                 REPLACE
                 XPATH
                   BIT_AND
           REGR_COUNT
           TRANSLATE
AREA
                EVERY
                   AT TIME ZONE
            RADIANS
           NOW
                  SQRT
ATAN2
               IS TRUE
                 RANDOM
                  SUM
               MIN
                  NOT LIKE
REGEXP_REPLACE
      RPAD
                    CEILING
                 TRIM
              TO_HEX
               LOG
DECODE
              NOW
                     WIDTH
                   STDDEV_POP
        GET_BYTE
             DATE_TRUNC
BOOL_OR
             REGR_SXX
                ROUND
                   LSEG
              XML_IS_WELL_FORMED
   VARIANCE
CUME_DIST
           PATH
                    COVAR_SAMP
              STRING_AGG
        LASTVAL
              UNNEST
OVERLAY
             PERCENT_RANK
            HOSTMASK
                PCLOSE
            HEIGHT
               ANY
POINT
               IN
                      ARRAY_DIMS
              MASKLEN
           DENSE_RANK
           LOCALTIMESTAMP
JUSTIFY_INTERVAL
    CURRENT_DATE
            CURSOR_TO_XML
           LIKE
              SETVAL
               LENGTH
POWER
               UPPER_INF
               GENERATE_SUBSCRIPTS
     POSITION
          LAST_VALUE
           INITCAP
IS NOT TRUE
         XMLAGG
                  PG_SLEEP
                VAR_POP
           STRPOS
               SIGN
FORMAT
              GENERATE_SERIES
         STDDEV_SAMP
             DENSE_RANK
        COT
                  SUBSTR
REVERSE
             REGR_INTERCEPT
          SIMILAR TO
              DATABASE_TO_XML
   ARRAY_CAT
            STDDEV
IS NOT FALSE
        DIAMETER
                NOTNULL
                 HOST
              TO_ASCII
             ABS
ROW_TO_JSON
         ROW_NUMBER
              SUBSTRING
               SETSEED
           ISFINITE
             SOME
SET_BIT
             ARRAY_NDIMS
             REGEXP_SPLIT_TO_ARRAY
   TO_TIMESTAMP
      NOT
                  MD5



                                                                                                                          22	
  
Leveraging PostgreSQL Features

Extensible, Rich Type System

Pluggable Format Handlers

Security

Internationalization

Connectivity: ODBC, JDBC

Ecosystem Add-Ons: 
    
PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET,
    
Python, etc)

                                                                 23	
  
Where are We Headed?
Distributed. SQL. Anywhere.
                                        CitusDB	
  Master	
  Node	
  

                                              Metadata	
  

                                        Distributed	
  Query	
  
                                              Planner	
  

                                        Distributed	
  Query	
  
                                             Executor	
  




     Local	
  Query	
  Planner	
       Local	
  Query	
  Planner	
       Local	
  Query	
  Planner	
  

     Local	
  Query	
  Executor	
     Local	
  Query	
  Executor	
      Local	
  Query	
  Executor	
  

    Foreign	
  Data	
  Wrapper	
      Foreign	
  Data	
  Wrapper	
      Foreign	
  Data	
  Wrapper	
  

               HDFS	
                          mongod	
                          RDBMS	
  
       Hadoop	
  Datanode	
              MongoDB	
  Shard	
                  RDBMS	
  server	
  


                                                                                                         24	
  
Defining the New Generation of 
Distributed Analytic Databases


SQL à Ease of Use, Increased Productivity

Real-time responsiveness à Faster

Data Locality à Proven Scalability

Schema-on-Read à Flexibility, Lower Cost



                                              25	
  
Where Are We At?


CitusDB SQL on Hadoop is in Open Beta

Download our Binary Packages

Or Use Our EC2 AMI

    http://citusdata.com/docs/sql-on-hadoop
                        

                                               26	
  
We’re Hiring
http://citusdata.com/job


                            27	
  
For questions and more information:
        info@citusdata.com
           (650) 566-9010




                                       28	
  

More Related Content

What's hot

Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handyPraveen Sripati
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Nicolas Morales
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 

What's hot (18)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 

Viewers also liked

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Gruter
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBRadenko Zec
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Matthew (정재화)
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 

Viewers also liked (9)

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 

Similar to SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneEnkitec
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 

Similar to SQL on Hadoop: Defining the New Generation of Analytic SQL Databases (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry Osborne
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 

More from OReillyStrata

Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.OReillyStrata
 
SapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadSapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadOReillyStrata
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldOReillyStrata
 
Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7OReillyStrata
 
Data as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteData as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteOReillyStrata
 
Giving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsGiving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsOReillyStrata
 
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?OReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)OReillyStrata
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballOReillyStrata
 
Designing Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryDesigning Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryOReillyStrata
 
Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012OReillyStrata
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012OReillyStrata
 

More from OReillyStrata (14)

Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
 
SapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadSapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_upload
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
 
Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7
 
Data as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteData as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data Institute
 
Giving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsGiving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business Questions
 
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the Hairball
 
Designing Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryDesigning Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of Discovery
 
Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012
 

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

  • 1. SQL  on  Hadoop   Defining  the  New  Genera/on  of     Analy/c  Databases   Strata Conference, February 2013
  • 2. Speaker Bio: Carl Steinbach Currently: Engineer at Citus Data PMC Chair, Committer -- Apache Hive Project @cwsteinbach on Twitter Formerly: Cloudera, Informatica, NetApp, Oracle 2  
  • 3. This is going to sound strange, but… I used to think databases were boring 3  
  • 4. Why? Undergrad at MIT 1997-2001 Number of Database Classes: 0 Number of Database Faculty Members: 0 My Conclusion: Databases are a Dead Field 4  
  • 5. Things Changed Over the Next Couple of Years I got a job! Database Group Formed at MIT (2003) - Mike Stonebraker - Sam Madden New Class: 6.830 Database Systems (2005) 5  
  • 6. What Changed? Web-scale Data New DB Research: Columnar Storage, NoSQL MPP Analytic Databases Gained Market Traction GFS (’03) and MapReduce (‘04) Papers Apache Hadoop – v0.1.0 released in 2006 6  
  • 7. What’s Good About Hadoop? Commodity Storage Scale-out Flexibility MapReduce Multi-structured Data 7  
  • 8. What’s Bad About Hadoop? MapReduce! No Schemas! Missing Features Optimizer, Indexes, Views Incompatibility with Existing Tools BI, ETL, IDEs 8  
  • 9. Apache Hive Solved Many of These Problems User  Client   HiveServer2   Hive  MetaStore   Hive  CLI   SQL  to  MapReduce   Table  to  Files   SQL  Queries   Catalog  Metadata   Compiler   Table  to  Format   ETL,  BI,  SQL  IDE   Rule  Based   Op/mizer   Hive  ODBC/JDBC   MR  Plan  Execu/on   Coordinator   Map/Reduce   Map/Reduce     Map/Reduce   Hive  Operators   Hive  Operators   Hive  Operators   Hive  SerDes   Hive  SerDes   Hive  SerDes   HDFS   HDFS   HDFS   datanode   datanode   datanode   9  
  • 10. But Other Problems Remained MapReduce: Latency Overhead Many Missing Features: •  ANSI SQL •  Cost Based Optimizer •  UDFs •  Data Types •  Security •  … 10  
  • 11. One Solution: Separate MPP DB Cluster MPP  Database  Cluster   MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Hadoop  Cluster   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   11  
  • 12. One Solution: Separate MPP DB Cluster MPP  Master  Node   Global  Query   Executor   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   MPP  Worker  Node   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Pull   Data  to   Work   IO  Bo]leneck   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   12  
  • 13. Better Solution: A New Architecture for SQL on Hadoop MPP  Master  Node   Global  Query   Push   Executor   Work   to   Data   Maintain   Local  Query   Local  Query   Local  Query   Local  Query   Executor   Executor   Executor   Executor   Data   Locality   HDFS   HDFS   HDFS   HDFS   datanode   datanode   datanode   datanode   13  
  • 14. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   14  
  • 15. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Metadata  Sync   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   Step  1)  The  CitusDB  Master  Node  retrieves  file  system  metadata  from  the  Hadoop  NameNode.   15  
  • 16. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   User  Query   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   Step  2)  The  user  submits  a  SQL  query  to  the  CitusDB  master  node  using  the  PostgreSQL  CLI  or  a  JDBC/ODBC  app.   16  
  • 17. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Queries   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   Step  3)  The  Master  Node  generates  an  op/mized  global  query  plan  and  sends  fragment  queries  to  the  workers.   17  
  • 18. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Results   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   Step  4)  The  CitusDB  worker  processes  running  on  each  DataNode  process  the  fragment  queries   18                              and  send  par/al  result  sets  back  to  the  Master  Node.  
  • 19. The New Architecture in Detail: CitusDB CitusDB  Master  Node   Hadoop   Metadata   Metadata   PostgreSQL   Tools   Query  Results   HDFS   Distributed  Query   ODBC/JDBC   Planner   NameNode   Clients   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrappers   Foreign  Data  Wrappers   Foreign  Data  Wrappers   HDFS   HDFS   HDFS   datanode   datanode   datanode   Step  5)  The  Master  Node  merges  the  par/al  result  sets  and  returns  the  final  result  to  the  user.   19  
  • 20. CitusDB: Standing on the Shoulders of Giants + Mature, Battle-tested Proven Scalability Enterprise Class Features Cost Effectiveness Has an Elephant Mascot Has an Elephant Mascot 20  
  • 21. Leveraging PostgreSQL Performance Cost-based Query Optimizer postgres=#  EXPLAIN  SELECT      customer.c_custkey,    sum((lineitem.l_extendedprice  *  (1::numeric  -­‐  lineitem.l_discount)))      ….    -­‐>    Sort    (cost=282459.19..282599.52  rows=56134  width=182)                    Sort  Key:  customer.c_custkey,  customer.c_name                    Sort  Method:  external  merge    Disk:  17192kB    ….      -­‐>    Hash  Join    (cost=39666.61..257246.25  rows=56134  width=16)                      Hash  Cond:  (lineitem.l_orderkey  =  orders.o_orderkey)                                      -­‐>    Seq  Scan  on  lineitem_102022  lineitem    (cost=0.00..190571.11)   21  
  • 22. Leveraging PostgreSQL Features: More than 300 Built-in Functions QUOTE_LITERAL REGR_SLOPE COS GREATEST QUOTE_IDENT SET_BYTE STRING_TO_ARRAY ENUM_RANGE EXTRACT REGR_SXY REGR_R2 XMLFOREST CONVERT_TO NTH_VALUE DIV OVERLAPS LAG LAG DATE_TRUNC SIN BTRIM FLOOR PI FORMAT TO_DATE TRANSACTION_TIMESTAMP LOWER SQRT TRUNC ARRAY_AGG LOWER_INC REGR_SYY CONCAT RTRIM STRIP LTRIM CHAR_LENGTH IS FALSE ARRAY_FILL REGR_AVGY XMLAGG BETWEEN CURRENT_TIMESTAMP BROADCAST JUSTIFY_DAYS IS DISTINCT UPPER BOX ARRAY_LENGTH ISCLOSED VAR_POP TIMEOFDAY COVAR_POP CURRVAL REPEAT VAR_SAMP OCTET_LENGTH LN NETMASK LOCALTIME UPPER QUERY_TO_XML STATEMENT_TIMESTAMP TO_CHAR FIRST_VALUE LPAD CASE GET_BIT TAN TRUNC LOWER_INF REGR_AVGX BOOL_AND IS NOT UNKNOWN ARRAY_APPEND ISNULL REGR_COUNT DATE_PART CORR ENUM_LAST XMLCOMMENT SCHEMA_TO_XML SET_MASKLEN ARRAY_TO_STRING XPATH_EXISTS NUMNODE REGEXP_MATCHES COALESCE NOW EXTRACT RADIUS SPLIT_PART CONVERT_FROM ENUM_FIRST ISOPEN UPPER_INC MOD REPLACE XPATH BIT_AND REGR_COUNT TRANSLATE AREA EVERY AT TIME ZONE RADIANS NOW SQRT ATAN2 IS TRUE RANDOM SUM MIN NOT LIKE REGEXP_REPLACE RPAD CEILING TRIM TO_HEX LOG DECODE NOW WIDTH STDDEV_POP GET_BYTE DATE_TRUNC BOOL_OR REGR_SXX ROUND LSEG XML_IS_WELL_FORMED VARIANCE CUME_DIST PATH COVAR_SAMP STRING_AGG LASTVAL UNNEST OVERLAY PERCENT_RANK HOSTMASK PCLOSE HEIGHT ANY POINT IN ARRAY_DIMS MASKLEN DENSE_RANK LOCALTIMESTAMP JUSTIFY_INTERVAL CURRENT_DATE CURSOR_TO_XML LIKE SETVAL LENGTH POWER UPPER_INF GENERATE_SUBSCRIPTS POSITION LAST_VALUE INITCAP IS NOT TRUE XMLAGG PG_SLEEP VAR_POP STRPOS SIGN FORMAT GENERATE_SERIES STDDEV_SAMP DENSE_RANK COT SUBSTR REVERSE REGR_INTERCEPT SIMILAR TO DATABASE_TO_XML ARRAY_CAT STDDEV IS NOT FALSE DIAMETER NOTNULL HOST TO_ASCII ABS ROW_TO_JSON ROW_NUMBER SUBSTRING SETSEED ISFINITE SOME SET_BIT ARRAY_NDIMS REGEXP_SPLIT_TO_ARRAY TO_TIMESTAMP NOT MD5 22  
  • 23. Leveraging PostgreSQL Features Extensible, Rich Type System Pluggable Format Handlers Security Internationalization Connectivity: ODBC, JDBC Ecosystem Add-Ons: PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET, Python, etc) 23  
  • 24. Where are We Headed? Distributed. SQL. Anywhere. CitusDB  Master  Node   Metadata   Distributed  Query   Planner   Distributed  Query   Executor   Local  Query  Planner   Local  Query  Planner   Local  Query  Planner   Local  Query  Executor   Local  Query  Executor   Local  Query  Executor   Foreign  Data  Wrapper   Foreign  Data  Wrapper   Foreign  Data  Wrapper   HDFS   mongod   RDBMS   Hadoop  Datanode   MongoDB  Shard   RDBMS  server   24  
  • 25. Defining the New Generation of Distributed Analytic Databases SQL à Ease of Use, Increased Productivity Real-time responsiveness à Faster Data Locality à Proven Scalability Schema-on-Read à Flexibility, Lower Cost 25  
  • 26. Where Are We At? CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop 26  
  • 28. For questions and more information: info@citusdata.com (650) 566-9010 28