SlideShare a Scribd company logo
1 of 46
Download to read offline
1
Parquet	
  Update/UDFs	
  in	
  Impala	
  
	
  
Nong	
  Li	
  
So:ware	
  Engineer,	
  Cloudera	
  
Agenda	
  
2
•  Parquet	
  
•  File	
  format	
  descripBon	
  
•  Benchmark	
  Results	
  in	
  Impala	
  
•  Parquet	
  2.0	
  
•  UDF/UDAs	
  
3
Parquet	
  
4
5
6
7
Data	
  Pages	
  
8
•  Values	
  are	
  stored	
  in	
  data	
  pages	
  as	
  a	
  triple:	
  
DefiniBon	
  Level,	
  RepeBBon	
  Level	
  and	
  Value.	
  
•  These	
  are	
  stored	
  conBguous	
  on	
  disk	
  =>	
  1	
  seek	
  to	
  read	
  a	
  
column	
  regardless	
  of	
  nesBng.	
  
•  Data	
  pages	
  are	
  stored	
  with	
  different	
  
encodings:	
  
•  Bit	
  packing	
  and	
  Run	
  Length	
  Encoding	
  (RLE)	
  
•  DicBonary	
  for	
  strings	
  
•  Extended	
  to	
  all	
  types	
  in	
  Parquet	
  1.1	
  
•  Plain	
  (liWle	
  endian	
  encoding)	
  for	
  naBve	
  types.	
  
Parquet	
  2.0	
  
9
•  AddiBonal	
  Encodings	
  
•  Group	
  VarInt	
  (for	
  small	
  ints)	
  
•  Improved	
  string	
  storage	
  format	
  
•  Delta	
  Encoding	
  (for	
  strings	
  and	
  ints)	
  
•  AddiBonal	
  Metadata	
  
•  Sorted	
  files	
  
•  Page/Column/File	
  StaBsBcs	
  
•  Expected	
  to	
  further	
  reduce	
  on	
  disk	
  size	
  and	
  
allow	
  for	
  skipping	
  values	
  on	
  the	
  read	
  path.	
  
Hardware	
  Setup	
  
10
•  10	
  Nodes	
  
•  16	
  Core	
  Xeon	
  
•  48	
  GB	
  Ram	
  
•  12	
  Disks	
  
•  CDH4.3	
  
•  Impala	
  1.1	
  
TPC-­‐H	
  lineitem	
  table	
  @	
  1TB	
  scale	
  factor	
  
11
0	
  
100	
  
200	
  
300	
  
400	
  
500	
  
600	
  
700	
  
800	
  
Text	
   Text	
  w/	
  Lzo	
   Seq	
  w/	
  Snappy	
   Avro	
  w/	
  Snappy	
   RcFile	
  w/	
  Snappy	
   Parquet	
  w/	
  Snappy	
   Seq	
  w/	
  Gzip	
  
Size	
  (GB)	
  
Query	
  Times	
  on	
  TPC-­‐H	
  lineitem	
  table	
  
12
0	
  
100	
  
200	
  
300	
  
400	
  
500	
  
600	
  
700	
  
800	
  
1	
  Column	
   3	
  Columns	
   5	
  Columns	
   16	
  (all)	
  Columns	
   5	
  Columns,	
  3	
  
Clients	
  
Tpch	
  Q1	
  (7	
  
Columns)	
  
Bytes	
  Read	
  Q1	
  
(GB)	
  
Text	
  
Seq	
  w/	
  Snappy	
  
Avro	
  w/	
  Snappy	
  
RcFile	
  w/	
  Snappy	
  
Parquet	
  w/	
  Snappy	
  
Query	
  Times	
  on	
  TPCDS	
  Queries	
  
13
0	
  
50	
  
100	
  
150	
  
200	
  
250	
  
300	
  
350	
  
400	
  
450	
  
500	
  
Q27	
   Q34	
   Q42	
   Q43	
   Q46	
   Q52	
   Q55	
   Q59	
   Q65	
   Q73	
   Q79	
   Q96	
  
Seconds	
  
Text	
  
Seq	
  w/	
  Snappy	
  
RC	
  w/Snappy	
  
Parquet	
  w/Snappy	
  
Average	
  Times	
  (Geometric	
  Mean)	
  
•  Text:	
  224	
  seconds	
  
•  Seq	
  Snappy:	
  257	
  seconds	
  
•  RC	
  Snappy:	
  150	
  seconds	
  
•  Parquet:	
  61	
  seconds	
  
Agenda	
  
14
•  Parquet	
  
•  File	
  format	
  descripBon	
  
•  Benchmark	
  Results	
  in	
  Impala	
  
•  What’s	
  Next	
  
•  UDF/UDAs	
  (Work	
  in	
  Progress)	
  
Terminology	
  
15
•  UDF:	
  Tuple	
  -­‐>	
  Scalar	
  
user-­‐defined	
  funcBon	
  
•  E.g.	
  Substring	
  
•  UDA/UDAF:	
  {Tuple}	
  -­‐>	
  Scalar	
  
user-­‐defined	
  aggregate	
  funcBon	
  
•  E.g.	
  Min	
  
•  UDTF:	
  {Tuple}	
  -­‐>	
  {Tuple}	
  
user-­‐defined	
  table	
  funcBon	
  
Impala	
  1.2	
  
16
•  Support	
  Hive	
  UDFs	
  (java)	
  
•  ExisBng	
  hive	
  jars	
  will	
  run	
  without	
  a	
  recompile.	
  
•  Add	
  Impala	
  (naBve)	
  UDFs	
  and	
  UDAs.	
  
•  New	
  interface	
  designed	
  to	
  execute	
  as	
  efficiently	
  as	
  
possible	
  for	
  Impala.	
  
•  Similar	
  interface	
  as	
  Postgres	
  UDFs/UDAs	
  
•  UDF/UDA	
  registered	
  for	
  impala	
  service	
  in	
  
metadata	
  catalog	
  
•  i.e.	
  CREATE	
  FUNCTION/CREATE	
  AGGREGATE	
  
	
  
	
  
Example	
  UDF	
  
17
//	
  This	
  UDF	
  adds	
  two	
  ints	
  and	
  returns	
  an	
  int.	
  	
  
IntVal	
  AddUdf(UdfContext*	
  context,	
  	
  
	
   	
   	
  	
  	
  	
  const	
  IntVal&	
  arg1,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  const	
  IntVal&	
  arg2)	
  { 	
  	
  
	
  	
  if	
  (arg1.is_null	
  ||	
  arg2.is_null)	
  return	
  IntVal::null();	
  
	
  	
  return	
  IntVal(arg1.val	
  +	
  arg2.val);	
  
}	
  
DDL	
  
18
CREATE	
  statement	
  will	
  need	
  to	
  specify	
  the	
  
UDF/UDA	
  signature,	
  the	
  locaBon	
  of	
  the	
  
binary	
  and	
  the	
  symbol	
  for	
  the	
  UDF	
  funBon.	
  
CREATE	
  FUNCTION	
  substring(string,	
  int,	
  int)	
  
RETURNS	
  string	
  LOCATION	
  “hdfs://path”	
  
“com.me.Substring”	
  
	
  
CREATE	
  FUNCTION	
  log(anytype)	
  RETURNS	
  anytype	
  
LOCATION	
  “hdfs:://path2”	
  “Log”	
  
UDFs	
  
19
•  Support	
  for	
  variadic	
  args	
  	
  
•  Support	
  for	
  polymorphic	
  types	
  
UDAs	
  
20
•  UDA	
  must	
  implement	
  typical	
  state	
  
machine:	
  
•  Init()	
  
•  Update()	
  
•  Serialize()	
  
•  Merge()	
  
•  Finalize()	
  
•  Data	
  movement	
  handled	
  by	
  Impala	
  
UDA	
  Example	
  
21
//	
  This	
  is	
  a	
  sample	
  of	
  implementing	
  the	
  COUNT	
  aggregate	
  function.	
  
	
  
void	
  Init(UdfContext*	
  context,	
  BigIntVal*	
  val)	
  {	
  
	
  	
  val-­‐>is_null	
  =	
  false;	
  
	
  	
  val-­‐>val	
  =	
  0;	
  
}	
  
	
  
void	
  Update(UdfContext*	
  context,	
  const	
  AnyVal&	
  input,	
  BigIntVal*	
  val)	
  {	
  
	
  	
  if	
  (input.is_null)	
  return;	
  
	
  	
  ++val-­‐>val;	
  
}	
  
	
  
void	
  Merge(UdfContext*	
  context,	
  const	
  BigIntVal&	
  src,	
  BigIntVal*	
  dst)	
  {	
  
	
  	
  dst-­‐>val	
  +=	
  src.val;	
  
}	
  
	
  
BigIntVal	
  Finalize(UdfContext*	
  context,	
  const	
  BigIntVal&	
  val)	
  {	
  
	
  	
  return	
  val;	
  
}	
  
RunBme	
  Code-­‐GeneraBon	
  
22
•  Impala	
  uses	
  LLVM	
  to,	
  at	
  runBme,	
  
generate	
  code	
  to	
  run	
  the	
  query.	
  
•  Takes	
  into	
  account	
  constants	
  that	
  that	
  are	
  only	
  
known	
  a:er	
  query	
  analysis.	
  
•  Greatly	
  improves	
  CPU	
  efficiency	
  
•  NaBve	
  UDFs/UDAs	
  can	
  benefit	
  from	
  this	
  as	
  
well.	
  
•  Instead	
  of	
  providing	
  the	
  UDF/UDA	
  as	
  a	
  shared	
  object,	
  
compile	
  it	
  (with	
  CLANG)	
  with	
  an	
  addiBonal	
  flag	
  and	
  
Impala	
  to	
  LLVM	
  IR	
  
•  IR	
  will	
  be	
  integrated	
  with	
  the	
  query	
  execuBon.	
  
•  No	
  funcBon	
  call	
  overhead	
  for	
  UDF/UDAs	
  
LimitaBons	
  
23
•  Hive	
  UDAs/UDTFs	
  not	
  supported	
  
•  No	
  UDTFs	
  in	
  naBve	
  interface	
  
•  Can’t	
  run	
  out	
  of	
  process	
  
•  NaBve	
  interface	
  is	
  designed	
  to	
  support	
  this,	
  
will	
  be	
  able	
  to	
  run	
  without	
  a	
  recompile	
  
•  We’re	
  planning	
  to	
  address	
  this	
  in	
  Impala	
  
1.3	
  
	
  
	
  
Thanks!	
  
24
•  We’d	
  love	
  your	
  feedback	
  for	
  UDFs/UDAs	
  
•  QuesBons?	
  
Performance
Considerations
for Cloudera
Impala
Henry Robinson
henry@cloudera.com / @henryr
Impala Meetup 2013-08-20
Agenda
● The basics: Performance Checklist
● Review: How does Impala execute queries?
● What makes queries fast (or slow)?
● How can I debug my queries?
Impala Performance Checklist
● Verify – Simple count * query on a relatively big table
and verify:
○ Data locality, block locality, and NO check-summing (“Testing Impala
Performance”)
○ Optimal IO throughput of HDFS scans (typically ~100 MB/s per disk)
● Stats – BOTH table and column stats, especially for:
○ Joining two large tables
○ Insert into as select through Impala
● Join table ordering – will be automatic in the Impala 2.0
wave. Until then:
○ Largest table first
○ Then most selective to least selective
● Monitor - monitor Impala queries to pinpoint slow
queries and drill into potential issues
○ CM 4.6 adds query monitoring
○ CM 5.0 will have the next big enhancements
Part 1: How does Impala
execute queries?
The basic idea
● Every Impala query runs across a cluster of
multiple nodes, with lots of available CPU
cores, memory and disk
● Best query speeds usually come when every
node in the cluster has something to do
● Impala solves two basic problems:
○ Figure out what every node should do (compilation)
○ Make them do it really quickly! (execution)
Query compilation
● a.k.a. ‘figuring out what every node should do’
● Impala compiles a SQL query into a plan describing
what to execute, and where
● A plan is shaped like a tree. Data flows up from the
leaves of the tree to the root.
● Each node in the tree is a query operator
● Impala chops this tree up into plan fragments
● Each node gets one or more plan fragments
Query execution
● Once started, each query operator can run
independently of any other operator
● Every operator can be doing something
at the same time
● This is the not-so-secret sauce for all
massively parallel query execution engines
Part 2: What makes
queries fast (or... slow)?
What determines performance?
● Data size
● Per-operator execution efficiency
● Available parallelism
● Available concurrency
● Hardware
● Schema design and file format
Data size
● More data means more work
● Not just the size of the disk-based data at plan leaves,
but size of internal data flowing in to any operator
● How can you help?
○ Partition your data
○ SELECT with LIMIT in subqueries
○ Push predicates down
○ Use correct JOIN order
■ Gather table statistics
○ Use the right file format
● Tables are joined in the order listed in the
FROM clause
● Impala uses left-deep trees for nested joins
● “Largest” table should be listed first
○ largest = returning most rows before join filtering
○ In a star schema, this is often the fact table
● Then list tables in order of most selective
join filter to least selective
○ Filter the most rows as early as possible
Table Ordering
Join Types
● Two types of join strategy are supported
○ Broadcast
○ Shuffle/Partitioned
● Broadcast
○ Each node receives a full copy of the right table
○ Per node memory usage = size of right table
● Shuffle
○ Both sides of the join are partitioned
○ Matching partitions sent to same node
○ Per node memory usage = 1/nodes x size of right table
● Without column statistics, all joins are broadcast
Per-operator execution efficiency
● Impala is fast, and getting faster
● LLVM-based improvements
● More efficient disk scanners
● More modern algorithms from the DB
literature
● How can you help?
○ Upgrade to the latest version
Available parallelism
● Parallelism: number of resources available to use at
once
● More hardware means more parallelism
● Impala will take advantage of more cores, disks and
memory where possible
● Easiest (but most expensive!) way to improve
performance of large class of queries
● You can scale up incrementally
Available concurrency
● Concurrency: how well can a query take advantage of
available parallelism?
● Impala will take care of this mostly for you
● But some operators naturally don’t parallelise well in
certain conditions
● For example: joining two huge tables together.
○ The hash-node operators have to wait for one side to be read
completely before reading much of the other side
● How you can help:
○ Read the profiles, look for obvious bottlenecks, rephrase if possible
Hardware
● Designed for modern hardware
○ Leverages SSE 4.2 (Intel Nehalem or newer)
○ LLVM Compiler Infrastructure
○ Runtime Code Generation
○ In-memory execution pipelines
● Today’s hardware
○ 2 x Xeon E5 6 core CPUs
○ 12 x 3 TB HDD
○ 128 GB RAM
● How you can help:
○ Use the supported platforms, with Cloudera’s
packages
Schema design
● PARTITION BY is an easy win
● In general, string is slower than fixed-width
types (particularly for aggregations etc)
● File formats are crucial
○ Experiment with Parquet for performance
○ Avoid text
Supported File Formats
● Various HDFS file formats
○ Text File (read/write)
○ Avro (read)
○ SequenceFile (read)
○ RCFile (read)
○ ParquetFile (read/write)
● Various compression codecs
○ Snappy (ParquetFile, RCFile, SequenceFile, Avro)
○ LZO (Text)
○ Bzip (ParquetFile, RCFile, SequenceFile, Avro)
○ Gzip (ParquetFile, RCFile, SequenceFile, Avro)
● HBase also supported
Partitioning Considerations
● Single largest performance feature
○ Skips unnecessary data
○ Requires queries contain partition keys as filters
● Choose a reasonable number of partitions
○ Lots of small files becomes an issue
○ Metadata overhead on NameNode
○ Metadata overhead for Hive Metastore
○ Impala caches this, but first load may take long
Part 3: Debugging queries
The Debug Pages
● Every impalad exports a lot of useful
information on http://<impalad>:25000 (by
default), including:
○ Last 25 queries
○ Active sessions
○ Known tables
○ Last 1MB of the log
○ System metrics
○ Query profiles
● Information-dense - not for the faint of heart!
Thanks! Questions?
Try It Out!
● Apache-licensed open source
○ Impala 1.1 released 7/24/2013
○ Impala 1.0 GA released 4/30/2013
● Questions/comments?
○ Download: cloudera.com/impala
○ Email: impala-user@cloudera.org
○ Join: groups.cloudera.org
○ MeetUp: meetup.com/Bay-Area-Impala-Users-
Group/

More Related Content

What's hot

Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impalaNAVER D2
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance UpdateCloudera, Inc.
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 

What's hot (20)

Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 

Viewers also liked

Deep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaDeep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaEvans Ye
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisYue Chen
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesQian Lin
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 

Viewers also liked (12)

Deep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaDeep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through Impala
 
Presto
PrestoPresto
Presto
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 

Similar to Presentations from the Cloudera Impala meetup on Aug 20 2013

Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial Na Zhu
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 

Similar to Presentations from the Cloudera Impala meetup on Aug 20 2013 (20)

Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Presentations from the Cloudera Impala meetup on Aug 20 2013

  • 1. 1 Parquet  Update/UDFs  in  Impala     Nong  Li   So:ware  Engineer,  Cloudera  
  • 2. Agenda   2 •  Parquet   •  File  format  descripBon   •  Benchmark  Results  in  Impala   •  Parquet  2.0   •  UDF/UDAs  
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. Data  Pages   8 •  Values  are  stored  in  data  pages  as  a  triple:   DefiniBon  Level,  RepeBBon  Level  and  Value.   •  These  are  stored  conBguous  on  disk  =>  1  seek  to  read  a   column  regardless  of  nesBng.   •  Data  pages  are  stored  with  different   encodings:   •  Bit  packing  and  Run  Length  Encoding  (RLE)   •  DicBonary  for  strings   •  Extended  to  all  types  in  Parquet  1.1   •  Plain  (liWle  endian  encoding)  for  naBve  types.  
  • 9. Parquet  2.0   9 •  AddiBonal  Encodings   •  Group  VarInt  (for  small  ints)   •  Improved  string  storage  format   •  Delta  Encoding  (for  strings  and  ints)   •  AddiBonal  Metadata   •  Sorted  files   •  Page/Column/File  StaBsBcs   •  Expected  to  further  reduce  on  disk  size  and   allow  for  skipping  values  on  the  read  path.  
  • 10. Hardware  Setup   10 •  10  Nodes   •  16  Core  Xeon   •  48  GB  Ram   •  12  Disks   •  CDH4.3   •  Impala  1.1  
  • 11. TPC-­‐H  lineitem  table  @  1TB  scale  factor   11 0   100   200   300   400   500   600   700   800   Text   Text  w/  Lzo   Seq  w/  Snappy   Avro  w/  Snappy   RcFile  w/  Snappy   Parquet  w/  Snappy   Seq  w/  Gzip   Size  (GB)  
  • 12. Query  Times  on  TPC-­‐H  lineitem  table   12 0   100   200   300   400   500   600   700   800   1  Column   3  Columns   5  Columns   16  (all)  Columns   5  Columns,  3   Clients   Tpch  Q1  (7   Columns)   Bytes  Read  Q1   (GB)   Text   Seq  w/  Snappy   Avro  w/  Snappy   RcFile  w/  Snappy   Parquet  w/  Snappy  
  • 13. Query  Times  on  TPCDS  Queries   13 0   50   100   150   200   250   300   350   400   450   500   Q27   Q34   Q42   Q43   Q46   Q52   Q55   Q59   Q65   Q73   Q79   Q96   Seconds   Text   Seq  w/  Snappy   RC  w/Snappy   Parquet  w/Snappy   Average  Times  (Geometric  Mean)   •  Text:  224  seconds   •  Seq  Snappy:  257  seconds   •  RC  Snappy:  150  seconds   •  Parquet:  61  seconds  
  • 14. Agenda   14 •  Parquet   •  File  format  descripBon   •  Benchmark  Results  in  Impala   •  What’s  Next   •  UDF/UDAs  (Work  in  Progress)  
  • 15. Terminology   15 •  UDF:  Tuple  -­‐>  Scalar   user-­‐defined  funcBon   •  E.g.  Substring   •  UDA/UDAF:  {Tuple}  -­‐>  Scalar   user-­‐defined  aggregate  funcBon   •  E.g.  Min   •  UDTF:  {Tuple}  -­‐>  {Tuple}   user-­‐defined  table  funcBon  
  • 16. Impala  1.2   16 •  Support  Hive  UDFs  (java)   •  ExisBng  hive  jars  will  run  without  a  recompile.   •  Add  Impala  (naBve)  UDFs  and  UDAs.   •  New  interface  designed  to  execute  as  efficiently  as   possible  for  Impala.   •  Similar  interface  as  Postgres  UDFs/UDAs   •  UDF/UDA  registered  for  impala  service  in   metadata  catalog   •  i.e.  CREATE  FUNCTION/CREATE  AGGREGATE      
  • 17. Example  UDF   17 //  This  UDF  adds  two  ints  and  returns  an  int.     IntVal  AddUdf(UdfContext*  context,                const  IntVal&  arg1,                                const  IntVal&  arg2)  {        if  (arg1.is_null  ||  arg2.is_null)  return  IntVal::null();      return  IntVal(arg1.val  +  arg2.val);   }  
  • 18. DDL   18 CREATE  statement  will  need  to  specify  the   UDF/UDA  signature,  the  locaBon  of  the   binary  and  the  symbol  for  the  UDF  funBon.   CREATE  FUNCTION  substring(string,  int,  int)   RETURNS  string  LOCATION  “hdfs://path”   “com.me.Substring”     CREATE  FUNCTION  log(anytype)  RETURNS  anytype   LOCATION  “hdfs:://path2”  “Log”  
  • 19. UDFs   19 •  Support  for  variadic  args     •  Support  for  polymorphic  types  
  • 20. UDAs   20 •  UDA  must  implement  typical  state   machine:   •  Init()   •  Update()   •  Serialize()   •  Merge()   •  Finalize()   •  Data  movement  handled  by  Impala  
  • 21. UDA  Example   21 //  This  is  a  sample  of  implementing  the  COUNT  aggregate  function.     void  Init(UdfContext*  context,  BigIntVal*  val)  {      val-­‐>is_null  =  false;      val-­‐>val  =  0;   }     void  Update(UdfContext*  context,  const  AnyVal&  input,  BigIntVal*  val)  {      if  (input.is_null)  return;      ++val-­‐>val;   }     void  Merge(UdfContext*  context,  const  BigIntVal&  src,  BigIntVal*  dst)  {      dst-­‐>val  +=  src.val;   }     BigIntVal  Finalize(UdfContext*  context,  const  BigIntVal&  val)  {      return  val;   }  
  • 22. RunBme  Code-­‐GeneraBon   22 •  Impala  uses  LLVM  to,  at  runBme,   generate  code  to  run  the  query.   •  Takes  into  account  constants  that  that  are  only   known  a:er  query  analysis.   •  Greatly  improves  CPU  efficiency   •  NaBve  UDFs/UDAs  can  benefit  from  this  as   well.   •  Instead  of  providing  the  UDF/UDA  as  a  shared  object,   compile  it  (with  CLANG)  with  an  addiBonal  flag  and   Impala  to  LLVM  IR   •  IR  will  be  integrated  with  the  query  execuBon.   •  No  funcBon  call  overhead  for  UDF/UDAs  
  • 23. LimitaBons   23 •  Hive  UDAs/UDTFs  not  supported   •  No  UDTFs  in  naBve  interface   •  Can’t  run  out  of  process   •  NaBve  interface  is  designed  to  support  this,   will  be  able  to  run  without  a  recompile   •  We’re  planning  to  address  this  in  Impala   1.3      
  • 24. Thanks!   24 •  We’d  love  your  feedback  for  UDFs/UDAs   •  QuesBons?  
  • 26. Agenda ● The basics: Performance Checklist ● Review: How does Impala execute queries? ● What makes queries fast (or slow)? ● How can I debug my queries?
  • 27. Impala Performance Checklist ● Verify – Simple count * query on a relatively big table and verify: ○ Data locality, block locality, and NO check-summing (“Testing Impala Performance”) ○ Optimal IO throughput of HDFS scans (typically ~100 MB/s per disk) ● Stats – BOTH table and column stats, especially for: ○ Joining two large tables ○ Insert into as select through Impala ● Join table ordering – will be automatic in the Impala 2.0 wave. Until then: ○ Largest table first ○ Then most selective to least selective ● Monitor - monitor Impala queries to pinpoint slow queries and drill into potential issues ○ CM 4.6 adds query monitoring ○ CM 5.0 will have the next big enhancements
  • 28. Part 1: How does Impala execute queries?
  • 29. The basic idea ● Every Impala query runs across a cluster of multiple nodes, with lots of available CPU cores, memory and disk ● Best query speeds usually come when every node in the cluster has something to do ● Impala solves two basic problems: ○ Figure out what every node should do (compilation) ○ Make them do it really quickly! (execution)
  • 30. Query compilation ● a.k.a. ‘figuring out what every node should do’ ● Impala compiles a SQL query into a plan describing what to execute, and where ● A plan is shaped like a tree. Data flows up from the leaves of the tree to the root. ● Each node in the tree is a query operator ● Impala chops this tree up into plan fragments ● Each node gets one or more plan fragments
  • 31. Query execution ● Once started, each query operator can run independently of any other operator ● Every operator can be doing something at the same time ● This is the not-so-secret sauce for all massively parallel query execution engines
  • 32. Part 2: What makes queries fast (or... slow)?
  • 33. What determines performance? ● Data size ● Per-operator execution efficiency ● Available parallelism ● Available concurrency ● Hardware ● Schema design and file format
  • 34. Data size ● More data means more work ● Not just the size of the disk-based data at plan leaves, but size of internal data flowing in to any operator ● How can you help? ○ Partition your data ○ SELECT with LIMIT in subqueries ○ Push predicates down ○ Use correct JOIN order ■ Gather table statistics ○ Use the right file format
  • 35. ● Tables are joined in the order listed in the FROM clause ● Impala uses left-deep trees for nested joins ● “Largest” table should be listed first ○ largest = returning most rows before join filtering ○ In a star schema, this is often the fact table ● Then list tables in order of most selective join filter to least selective ○ Filter the most rows as early as possible Table Ordering
  • 36. Join Types ● Two types of join strategy are supported ○ Broadcast ○ Shuffle/Partitioned ● Broadcast ○ Each node receives a full copy of the right table ○ Per node memory usage = size of right table ● Shuffle ○ Both sides of the join are partitioned ○ Matching partitions sent to same node ○ Per node memory usage = 1/nodes x size of right table ● Without column statistics, all joins are broadcast
  • 37. Per-operator execution efficiency ● Impala is fast, and getting faster ● LLVM-based improvements ● More efficient disk scanners ● More modern algorithms from the DB literature ● How can you help? ○ Upgrade to the latest version
  • 38. Available parallelism ● Parallelism: number of resources available to use at once ● More hardware means more parallelism ● Impala will take advantage of more cores, disks and memory where possible ● Easiest (but most expensive!) way to improve performance of large class of queries ● You can scale up incrementally
  • 39. Available concurrency ● Concurrency: how well can a query take advantage of available parallelism? ● Impala will take care of this mostly for you ● But some operators naturally don’t parallelise well in certain conditions ● For example: joining two huge tables together. ○ The hash-node operators have to wait for one side to be read completely before reading much of the other side ● How you can help: ○ Read the profiles, look for obvious bottlenecks, rephrase if possible
  • 40. Hardware ● Designed for modern hardware ○ Leverages SSE 4.2 (Intel Nehalem or newer) ○ LLVM Compiler Infrastructure ○ Runtime Code Generation ○ In-memory execution pipelines ● Today’s hardware ○ 2 x Xeon E5 6 core CPUs ○ 12 x 3 TB HDD ○ 128 GB RAM ● How you can help: ○ Use the supported platforms, with Cloudera’s packages
  • 41. Schema design ● PARTITION BY is an easy win ● In general, string is slower than fixed-width types (particularly for aggregations etc) ● File formats are crucial ○ Experiment with Parquet for performance ○ Avoid text
  • 42. Supported File Formats ● Various HDFS file formats ○ Text File (read/write) ○ Avro (read) ○ SequenceFile (read) ○ RCFile (read) ○ ParquetFile (read/write) ● Various compression codecs ○ Snappy (ParquetFile, RCFile, SequenceFile, Avro) ○ LZO (Text) ○ Bzip (ParquetFile, RCFile, SequenceFile, Avro) ○ Gzip (ParquetFile, RCFile, SequenceFile, Avro) ● HBase also supported
  • 43. Partitioning Considerations ● Single largest performance feature ○ Skips unnecessary data ○ Requires queries contain partition keys as filters ● Choose a reasonable number of partitions ○ Lots of small files becomes an issue ○ Metadata overhead on NameNode ○ Metadata overhead for Hive Metastore ○ Impala caches this, but first load may take long
  • 44. Part 3: Debugging queries
  • 45. The Debug Pages ● Every impalad exports a lot of useful information on http://<impalad>:25000 (by default), including: ○ Last 25 queries ○ Active sessions ○ Known tables ○ Last 1MB of the log ○ System metrics ○ Query profiles ● Information-dense - not for the faint of heart!
  • 46. Thanks! Questions? Try It Out! ● Apache-licensed open source ○ Impala 1.1 released 7/24/2013 ○ Impala 1.0 GA released 4/30/2013 ● Questions/comments? ○ Download: cloudera.com/impala ○ Email: impala-user@cloudera.org ○ Join: groups.cloudera.org ○ MeetUp: meetup.com/Bay-Area-Impala-Users- Group/