SlideShare a Scribd company logo
1 of 16
Download to read offline
1
Inges&ng	
  HDFS	
  data	
  into	
  Solr	
  using	
  
Spark	
  
Wolfgang	
  Hoschek	
  (whoschek@cloudera.com)	
  
So@ware	
  Engineer	
  @	
  Cloudera	
  Search	
  
Lucene/Solr	
  Meetup,	
  Jan	
  2015	
  
The	
  Enterprise	
  Data	
  Hub	
  
Unified	
  Scale-­‐out	
  Storage	
  
For	
  Any	
  Type	
  of	
  Data	
  
Elas&c,	
  Fault-­‐tolerant,	
  Self-­‐healing,	
  In-­‐memory	
  capabili&es	
  
Resource	
  Management	
  
Online	
  
NoSQL	
  	
  
DBMS	
  
Analy>c	
  	
  
MPP	
  DBMS	
  
Search	
  	
  
Engine	
  
Batch	
  	
  
Processing	
  
Stream	
  	
  
Processing	
  
Machine	
  	
  
Learning	
  
SQL	
   Streaming	
   File	
  System	
  (NFS)	
  
System	
  	
  
Management	
  
Data	
  	
  
Management	
  
Metadata,	
  Security,	
  Audit,	
  Lineage	
  
The image cannot be displayed.
Your computer may not have
2	
  
•  Mul&ple	
  processing	
  
frameworks	
  
•  One	
  pool	
  of	
  data	
  
•  One	
  set	
  of	
  system	
  
resources	
  
•  One	
  management	
  
interface	
  
•  One	
  security	
  
framework	
  
Apache	
  Spark	
  
•  Mission	
  
•  Fast	
  and	
  general	
  engine	
  for	
  large-­‐scale	
  data	
  processing	
  
•  Speed	
  
•  Advanced	
  DAG	
  execu&on	
  engine	
  that	
  supports	
  cyclic	
  data	
  
flow	
  and	
  in-­‐memory	
  compu&ng	
  
•  Ease	
  of	
  Use	
  
•  Write	
  applica&ons	
  quickly	
  in	
  Java,	
  Scala	
  or	
  Python	
  
•  Generality	
  
•  Combine	
  batch,	
  streaming,	
  and	
  complex	
  analy&cs	
  
•  Successor	
  to	
  MapReduce	
  
3	
  
Open	
  Source	
  
•  100%	
  Apache,	
  100%	
  Solr	
  
•  Standard	
  Solr	
  APIs	
  
What	
  is	
  Cloudera	
  Search?	
  
Interac>ve	
  search	
  for	
  Hadoop	
  
•  Full-­‐text	
  and	
  faceted	
  naviga&on	
  
•  Batch,	
  near	
  real-­‐&me,	
  and	
  on-­‐demand	
  indexing	
  
4
Apache	
  Solr	
  integrated	
  with	
  CDH	
  
•  Established,	
  mature	
  search	
  with	
  vibrant	
  community	
  
•  Incorporated	
  as	
  part	
  of	
  the	
  Hadoop	
  ecosystem	
  
•  Apache	
  Flume,	
  Apache	
  HBase	
  
•  Apache	
  MapReduce,	
  Kite	
  Morphlines	
  
•  Apache	
  Spark,	
  Apache	
  Crunch	
  
HDFS	
  
Online	
  Streaming	
  Data	
  
End	
  User	
  Client	
  App	
  
(e.g.	
  Hue)	
  
Flume	
  
Raw,	
  filtered,	
  or	
  
annotated	
  data	
  
SolrCloud	
  Cluster(s)	
  
NRT	
  Data	
  
indexed	
  w/	
  
Morphlines	
  
Indexed	
  data	
  
Spark	
  &	
  MapReduce	
  Batch	
  
Indexing	
  w/	
  Morphlines	
  
GoLive	
  updates	
  
HBase	
  
Cluster	
  
NRT	
  Replica&on	
  
Events	
  indexed	
  
w/	
  Morphlines	
  
OLTP	
  Data	
  
Cloudera	
  Manager	
  
Search	
  queries	
  
Cloudera	
  Search	
  Architecture	
  Overview	
  
5	
  
Customizable	
  Hue	
  UI	
  
•  Navigated,	
  faceted	
  drill	
  down	
  
•  Full	
  text	
  search,	
  standard	
  
Solr	
  API	
  and	
  query	
  language	
  
6	
   hdp://gethue.com	
  
Scalable	
  Batch	
  ETL	
  &	
  Indexing	
  
Index	
  
shard	
  
Files	
  
Index	
  
shard	
  
Indexer	
  w/	
  
Morphlines	
  
Files	
  or	
  
HBase	
  
tables	
  
Solr	
  
server	
  
Indexer	
  w/	
  
Morphlines	
  
Solr	
  
server	
  
7
HDFS	
  
Solr	
  and	
  MapReduce	
  
•  Flexible,	
  scalable,	
  reliable	
  
batch	
  indexing	
  
•  On-­‐demand	
  indexing,	
  cost-­‐
efficient	
  re-­‐indexing	
  
•  Start	
  serving	
  new	
  indices	
  
without	
  down&me	
  
•  “MapReduceIndexerTool”	
  
•  “HBaseMapReduceIndexerTool”	
  
•  “CrunchIndexerTool	
  on	
  MR”	
  
Solr	
  and	
  Spark	
  
•  “CrunchIndexerTool	
  	
  on	
  Spark”	
  
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
Streaming	
  ETL	
  (Extract,	
  Transform,	
  Load)	
  
Kite	
  Morphlines	
  
•  Consume	
  any	
  kind	
  of	
  data	
  from	
  any	
  
kind	
  of	
  data	
  source,	
  process	
  and	
  load	
  
into	
  Solr,	
  HDFS,	
  HBase	
  or	
  anything	
  else	
  
•  Simple	
  and	
  flexible	
  data	
  transforma&on	
  
•  Extensible	
  set	
  of	
  transf.	
  commands	
  
•  Reusable	
  across	
  mul&ple	
  workloads	
  
•  For	
  Batch	
  &	
  Near	
  Real	
  Time	
  
•  Configura&on	
  over	
  coding	
  	
  
•  reduces	
  &me	
  &	
  skills	
  
•  ASL	
  licensed	
  on	
  Github	
  
hdps://github.com/kite-­‐sdk/kite	
  
syslog	
   Flume	
  
Agent	
  
Solr	
  sink	
  
Command:	
  readLine	
  
Command:	
  grok	
  
Command:	
  loadSolr	
  
Solr	
  
Event	
  
Record	
  
Record	
  
Record	
  
Document	
  
8	
  
Morphline	
  Example	
  –	
  syslog	
  with	
  grok	
  
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCommands	
  :	
  [”org.kitesdk.**",	
  "org.apache.solr.**"]	
  
	
  	
  	
  commands	
  :	
  [	
  
	
  	
  	
  	
  	
  {	
  readLine	
  {}	
  }	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  grok	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  dic&onaryFiles	
  :	
  [/tmp/grok-­‐dic&onaries]	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  expressions	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  message	
  :	
  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_&mestamp}	
  %
{SYSLOGHOST:syslog_hostname}	
  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:	
  %
{GREEDYDATA:syslog_message}"""	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  {	
  loadSolr	
  {}	
  }	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  ]	
  
	
  }	
  
]	
  
Example	
  Input	
  
<164>Feb	
  	
  4	
  10:46:14	
  syslog	
  sshd[607]:	
  listening	
  on	
  0.0.0.0	
  port	
  22	
  
Output	
  Record	
  
syslog_pri:164	
  
syslog_&mestamp:Feb	
  	
  4	
  10:46:14	
  
syslog_hostname:syslog	
  
syslog_program:sshd	
  
syslog_pid:607	
  
syslog_message:listening	
  on	
  0.0.0.0	
  port	
  22.	
  
	
  
	
  9
Example	
  Java	
  Driver	
  Program	
  -­‐	
  
Can	
  be	
  wrapped	
  into	
  Spark	
  func&ons	
  
/** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */
public static void main(String[] args) {
// compile morphline.conf file on the fly
File conf= new File(args[0]);
MorphlineContext ctx= new MorphlineContext.Builder().build();
Command morphline = new Compiler().compile(conf, null, ctx, null);
// process each input data file
Notifications.notifyBeginTransaction(morphline);
for (int i = 1; i < args.length; i++) {
InputStream in = new FileInputStream(new File(args[i]));
Record record = new Record();
record.put(Fields.ATTACHMENT_BODY, in);
morphline.process(record);
in.close();
}
Notifications.notifyCommitTransaction(morphline);
}
10
Scalable	
  Batch	
  Indexing	
  
11
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
S0_0_0
Extractors
(Mappers)
Leaf Shards
(Reducers)
Root Shards
(Mappers)
S0_0_1
S0S0_1_0
S0_1_1
S1_0_0
S1_0_1
S1S1_1_0
S1_1_1
Input
Files
...
...
...
...
•  Morphline	
  runs	
  inside	
  Mapper	
  
•  Reducers	
  build	
  local	
  Solr	
  indexes	
  
•  Mappers	
  merge	
  microshards	
  
•  GoLive	
  merges	
  into	
  live	
  SolrCloud	
  
GoLive	
  
GoLive	
  
•  Can	
  exploit	
  all	
  reducer	
  slots	
  even	
  if	
  	
  
#reducers	
  >>	
  #solrShards	
  
•  Great	
  throughput	
  but	
  poor	
  latency	
  
•  Only	
  inserts,	
  no	
  updates	
  &	
  deletes!	
  
•  Want	
  to	
  migrate	
  from	
  MR	
  to	
  Spark	
  
Batching	
  Indexing	
  with	
  CrunchIndexerTool	
  
12
spark-submit ... CrunchIndexerTool --morphline-file morphline.conf ...
or	
  
hadoop ... CrunchIndexerTool --morphline-file morphline.conf ...
•  Morphline	
  runs	
  inside	
  Spark	
  executors	
  
•  Morphline	
  sends	
  docs	
  to	
  live	
  SolrCloud	
  
•  Good	
  throughput	
  and	
  good	
  latency	
  
•  Supports	
  inserts,	
  updates	
  &	
  deletes	
  
•  Flag	
  to	
  run	
  on	
  Spark	
  or	
  MapReduce	
  
Extractors
(Executors/
Mappers)
SolrCloud
Shards
S0
S1
Input
Files
...
...
...
...
More	
  CrunchIndexerTool	
  features	
  (1/2)	
  
•  Implemented	
  with	
  Apache	
  Crunch	
  library	
  
•  Eases	
  migra&on	
  from	
  MapReduce	
  execu&on	
  engine	
  to	
  Spark	
  
execu&on	
  engine	
  –	
  can	
  run	
  on	
  either	
  engine	
  
•  Supported	
  Spark	
  modes	
  
•  Local	
  (for	
  tes&ng)	
  
•  YARN	
  client	
  
•  YARN	
  cluster	
  (for	
  produc&on)	
  
•  Efficient	
  batching	
  of	
  Solr	
  updates	
  and	
  deleteById	
  and	
  
deleteByQuery	
  
•  Efficient	
  locality-­‐aware	
  processing	
  for	
  splidable	
  HDFS	
  files	
  
•  avro,	
  parquet,	
  text	
  lines	
  
13
More	
  CrunchIndexerTool	
  features	
  (2/2)	
  
•  Dry-­‐run	
  mode	
  for	
  rapid	
  prototyping	
  
•  Sends	
  commit	
  to	
  Solr	
  on	
  job	
  success	
  
•  Inherits	
  Fault	
  tolerance	
  &	
  retry	
  from	
  Spark	
  (and	
  MR)	
  
•  Security	
  in	
  progress:	
  Kerberos	
  token	
  delega&on,	
  SSL	
  
•  ASL	
  licensed	
  on	
  Github	
  
•  hdps://github.com/cloudera/search/tree/
cdh5-­‐1.0.0_5.3.0/search-­‐crunch	
  
14
Conclusions	
  
•  Easy	
  migra&on	
  from	
  MapReduce	
  to	
  Spark	
  
•  Also	
  supports	
  updates	
  &	
  deletes	
  &	
  good	
  latency	
  
•  Recommenda&on	
  
•  Use	
  MapReduceIndexerTool	
  for	
  large	
  scale	
  batch	
  inges&on	
  
use	
  cases	
  where	
  updates	
  or	
  deletes	
  of	
  exis&ng	
  documents	
  
in	
  Solr	
  are	
  not	
  required	
  
•  Use	
  CrunchIndexerTool	
  for	
  all	
  other	
  use	
  cases	
  
•  Shipping	
  in	
  CDH5.2	
  and	
  above	
  
15
©2014 Cloudera, Inc. All rights
reserved.

More Related Content

What's hot

Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventGruter
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Hyunsik Choi
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 

What's hot (20)

Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
6.hive
6.hive6.hive
6.hive
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 

Viewers also liked

Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchCloudera, Inc.
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Powering the Internet of Things with Apache Hadoop
Powering the Internet of Things with Apache HadoopPowering the Internet of Things with Apache Hadoop
Powering the Internet of Things with Apache HadoopCloudera, Inc.
 
Welcome to the Age of Big Data in Banking
Welcome to the Age of Big Data in Banking Welcome to the Age of Big Data in Banking
Welcome to the Age of Big Data in Banking Andy Hirst
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Big Data in Banking (White paper)
Big Data in Banking (White paper)Big Data in Banking (White paper)
Big Data in Banking (White paper)InData Labs
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Capgemini
 
Big Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of ViewBig Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of ViewPietro Leo
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (17)

Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Powering the Internet of Things with Apache Hadoop
Powering the Internet of Things with Apache HadoopPowering the Internet of Things with Apache Hadoop
Powering the Internet of Things with Apache Hadoop
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 
Welcome to the Age of Big Data in Banking
Welcome to the Age of Big Data in Banking Welcome to the Age of Big Data in Banking
Welcome to the Age of Big Data in Banking
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Big Data in Banking (White paper)
Big Data in Banking (White paper)Big Data in Banking (White paper)
Big Data in Banking (White paper)
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry
 
Big Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of ViewBig Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of View
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Ingesting hdfs intosolrusingsparktrimmed

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Open Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsOpen Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsPhase2
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsPhase2
 
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilitiesVorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilitiesDefconRussia
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 

Similar to Ingesting hdfs intosolrusingsparktrimmed (20)

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Open Source Logging and Metrics Tools
Open Source Logging and Metrics ToolsOpen Source Logging and Metrics Tools
Open Source Logging and Metrics Tools
 
Open Source Logging and Monitoring Tools
Open Source Logging and Monitoring ToolsOpen Source Logging and Monitoring Tools
Open Source Logging and Monitoring Tools
 
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilitiesVorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 

Recently uploaded

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 

Recently uploaded (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 

Ingesting hdfs intosolrusingsparktrimmed

  • 1. 1 Inges&ng  HDFS  data  into  Solr  using   Spark   Wolfgang  Hoschek  (whoschek@cloudera.com)   So@ware  Engineer  @  Cloudera  Search   Lucene/Solr  Meetup,  Jan  2015  
  • 2. The  Enterprise  Data  Hub   Unified  Scale-­‐out  Storage   For  Any  Type  of  Data   Elas&c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili&es   Resource  Management   Online   NoSQL     DBMS   Analy>c     MPP  DBMS   Search     Engine   Batch     Processing   Stream     Processing   Machine     Learning   SQL   Streaming   File  System  (NFS)   System     Management   Data     Management   Metadata,  Security,  Audit,  Lineage   The image cannot be displayed. Your computer may not have 2   •  Mul&ple  processing   frameworks   •  One  pool  of  data   •  One  set  of  system   resources   •  One  management   interface   •  One  security   framework  
  • 3. Apache  Spark   •  Mission   •  Fast  and  general  engine  for  large-­‐scale  data  processing   •  Speed   •  Advanced  DAG  execu&on  engine  that  supports  cyclic  data   flow  and  in-­‐memory  compu&ng   •  Ease  of  Use   •  Write  applica&ons  quickly  in  Java,  Scala  or  Python   •  Generality   •  Combine  batch,  streaming,  and  complex  analy&cs   •  Successor  to  MapReduce   3  
  • 4. Open  Source   •  100%  Apache,  100%  Solr   •  Standard  Solr  APIs   What  is  Cloudera  Search?   Interac>ve  search  for  Hadoop   •  Full-­‐text  and  faceted  naviga&on   •  Batch,  near  real-­‐&me,  and  on-­‐demand  indexing   4 Apache  Solr  integrated  with  CDH   •  Established,  mature  search  with  vibrant  community   •  Incorporated  as  part  of  the  Hadoop  ecosystem   •  Apache  Flume,  Apache  HBase   •  Apache  MapReduce,  Kite  Morphlines   •  Apache  Spark,  Apache  Crunch  
  • 5. HDFS   Online  Streaming  Data   End  User  Client  App   (e.g.  Hue)   Flume   Raw,  filtered,  or   annotated  data   SolrCloud  Cluster(s)   NRT  Data   indexed  w/   Morphlines   Indexed  data   Spark  &  MapReduce  Batch   Indexing  w/  Morphlines   GoLive  updates   HBase   Cluster   NRT  Replica&on   Events  indexed   w/  Morphlines   OLTP  Data   Cloudera  Manager   Search  queries   Cloudera  Search  Architecture  Overview   5  
  • 6. Customizable  Hue  UI   •  Navigated,  faceted  drill  down   •  Full  text  search,  standard   Solr  API  and  query  language   6   hdp://gethue.com  
  • 7. Scalable  Batch  ETL  &  Indexing   Index   shard   Files   Index   shard   Indexer  w/   Morphlines   Files  or   HBase   tables   Solr   server   Indexer  w/   Morphlines   Solr   server   7 HDFS   Solr  and  MapReduce   •  Flexible,  scalable,  reliable   batch  indexing   •  On-­‐demand  indexing,  cost-­‐ efficient  re-­‐indexing   •  Start  serving  new  indices   without  down&me   •  “MapReduceIndexerTool”   •  “HBaseMapReduceIndexerTool”   •  “CrunchIndexerTool  on  MR”   Solr  and  Spark   •  “CrunchIndexerTool    on  Spark”   hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
  • 8. Streaming  ETL  (Extract,  Transform,  Load)   Kite  Morphlines   •  Consume  any  kind  of  data  from  any   kind  of  data  source,  process  and  load   into  Solr,  HDFS,  HBase  or  anything  else   •  Simple  and  flexible  data  transforma&on   •  Extensible  set  of  transf.  commands   •  Reusable  across  mul&ple  workloads   •  For  Batch  &  Near  Real  Time   •  Configura&on  over  coding     •  reduces  &me  &  skills   •  ASL  licensed  on  Github   hdps://github.com/kite-­‐sdk/kite   syslog   Flume   Agent   Solr  sink   Command:  readLine   Command:  grok   Command:  loadSolr   Solr   Event   Record   Record   Record   Document   8  
  • 9. Morphline  Example  –  syslog  with  grok   morphlines  :  [    {        id  :  morphline1        importCommands  :  [”org.kitesdk.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic&onaryFiles  :  [/tmp/grok-­‐dic&onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_&mestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example  Input   <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22   Output  Record   syslog_pri:164   syslog_&mestamp:Feb    4  10:46:14   syslog_hostname:syslog   syslog_program:sshd   syslog_pid:607   syslog_message:listening  on  0.0.0.0  port  22.      9
  • 10. Example  Java  Driver  Program  -­‐   Can  be  wrapped  into  Spark  func&ons   /** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */ public static void main(String[] args) { // compile morphline.conf file on the fly File conf= new File(args[0]); MorphlineContext ctx= new MorphlineContext.Builder().build(); Command morphline = new Compiler().compile(conf, null, ctx, null); // process each input data file Notifications.notifyBeginTransaction(morphline); for (int i = 1; i < args.length; i++) { InputStream in = new FileInputStream(new File(args[i])); Record record = new Record(); record.put(Fields.ATTACHMENT_BODY, in); morphline.process(record); in.close(); } Notifications.notifyCommitTransaction(morphline); } 10
  • 11. Scalable  Batch  Indexing   11 hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ... S0_0_0 Extractors (Mappers) Leaf Shards (Reducers) Root Shards (Mappers) S0_0_1 S0S0_1_0 S0_1_1 S1_0_0 S1_0_1 S1S1_1_0 S1_1_1 Input Files ... ... ... ... •  Morphline  runs  inside  Mapper   •  Reducers  build  local  Solr  indexes   •  Mappers  merge  microshards   •  GoLive  merges  into  live  SolrCloud   GoLive   GoLive   •  Can  exploit  all  reducer  slots  even  if     #reducers  >>  #solrShards   •  Great  throughput  but  poor  latency   •  Only  inserts,  no  updates  &  deletes!   •  Want  to  migrate  from  MR  to  Spark  
  • 12. Batching  Indexing  with  CrunchIndexerTool   12 spark-submit ... CrunchIndexerTool --morphline-file morphline.conf ... or   hadoop ... CrunchIndexerTool --morphline-file morphline.conf ... •  Morphline  runs  inside  Spark  executors   •  Morphline  sends  docs  to  live  SolrCloud   •  Good  throughput  and  good  latency   •  Supports  inserts,  updates  &  deletes   •  Flag  to  run  on  Spark  or  MapReduce   Extractors (Executors/ Mappers) SolrCloud Shards S0 S1 Input Files ... ... ... ...
  • 13. More  CrunchIndexerTool  features  (1/2)   •  Implemented  with  Apache  Crunch  library   •  Eases  migra&on  from  MapReduce  execu&on  engine  to  Spark   execu&on  engine  –  can  run  on  either  engine   •  Supported  Spark  modes   •  Local  (for  tes&ng)   •  YARN  client   •  YARN  cluster  (for  produc&on)   •  Efficient  batching  of  Solr  updates  and  deleteById  and   deleteByQuery   •  Efficient  locality-­‐aware  processing  for  splidable  HDFS  files   •  avro,  parquet,  text  lines   13
  • 14. More  CrunchIndexerTool  features  (2/2)   •  Dry-­‐run  mode  for  rapid  prototyping   •  Sends  commit  to  Solr  on  job  success   •  Inherits  Fault  tolerance  &  retry  from  Spark  (and  MR)   •  Security  in  progress:  Kerberos  token  delega&on,  SSL   •  ASL  licensed  on  Github   •  hdps://github.com/cloudera/search/tree/ cdh5-­‐1.0.0_5.3.0/search-­‐crunch   14
  • 15. Conclusions   •  Easy  migra&on  from  MapReduce  to  Spark   •  Also  supports  updates  &  deletes  &  good  latency   •  Recommenda&on   •  Use  MapReduceIndexerTool  for  large  scale  batch  inges&on   use  cases  where  updates  or  deletes  of  exis&ng  documents   in  Solr  are  not  required   •  Use  CrunchIndexerTool  for  all  other  use  cases   •  Shipping  in  CDH5.2  and  above   15
  • 16. ©2014 Cloudera, Inc. All rights reserved.