Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
Ingesting hdfs intosolrusingsparktrimmed
1. 1
Inges&ng
HDFS
data
into
Solr
using
Spark
Wolfgang
Hoschek
(whoschek@cloudera.com)
So@ware
Engineer
@
Cloudera
Search
Lucene/Solr
Meetup,
Jan
2015
2. The
Enterprise
Data
Hub
Unified
Scale-‐out
Storage
For
Any
Type
of
Data
Elas&c,
Fault-‐tolerant,
Self-‐healing,
In-‐memory
capabili&es
Resource
Management
Online
NoSQL
DBMS
Analy>c
MPP
DBMS
Search
Engine
Batch
Processing
Stream
Processing
Machine
Learning
SQL
Streaming
File
System
(NFS)
System
Management
Data
Management
Metadata,
Security,
Audit,
Lineage
The image cannot be displayed.
Your computer may not have
2
• Mul&ple
processing
frameworks
• One
pool
of
data
• One
set
of
system
resources
• One
management
interface
• One
security
framework
3. Apache
Spark
• Mission
• Fast
and
general
engine
for
large-‐scale
data
processing
• Speed
• Advanced
DAG
execu&on
engine
that
supports
cyclic
data
flow
and
in-‐memory
compu&ng
• Ease
of
Use
• Write
applica&ons
quickly
in
Java,
Scala
or
Python
• Generality
• Combine
batch,
streaming,
and
complex
analy&cs
• Successor
to
MapReduce
3
4. Open
Source
• 100%
Apache,
100%
Solr
• Standard
Solr
APIs
What
is
Cloudera
Search?
Interac>ve
search
for
Hadoop
• Full-‐text
and
faceted
naviga&on
• Batch,
near
real-‐&me,
and
on-‐demand
indexing
4
Apache
Solr
integrated
with
CDH
• Established,
mature
search
with
vibrant
community
• Incorporated
as
part
of
the
Hadoop
ecosystem
• Apache
Flume,
Apache
HBase
• Apache
MapReduce,
Kite
Morphlines
• Apache
Spark,
Apache
Crunch
5. HDFS
Online
Streaming
Data
End
User
Client
App
(e.g.
Hue)
Flume
Raw,
filtered,
or
annotated
data
SolrCloud
Cluster(s)
NRT
Data
indexed
w/
Morphlines
Indexed
data
Spark
&
MapReduce
Batch
Indexing
w/
Morphlines
GoLive
updates
HBase
Cluster
NRT
Replica&on
Events
indexed
w/
Morphlines
OLTP
Data
Cloudera
Manager
Search
queries
Cloudera
Search
Architecture
Overview
5
6. Customizable
Hue
UI
• Navigated,
faceted
drill
down
• Full
text
search,
standard
Solr
API
and
query
language
6
hdp://gethue.com
7. Scalable
Batch
ETL
&
Indexing
Index
shard
Files
Index
shard
Indexer
w/
Morphlines
Files
or
HBase
tables
Solr
server
Indexer
w/
Morphlines
Solr
server
7
HDFS
Solr
and
MapReduce
• Flexible,
scalable,
reliable
batch
indexing
• On-‐demand
indexing,
cost-‐
efficient
re-‐indexing
• Start
serving
new
indices
without
down&me
• “MapReduceIndexerTool”
• “HBaseMapReduceIndexerTool”
• “CrunchIndexerTool
on
MR”
Solr
and
Spark
• “CrunchIndexerTool
on
Spark”
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
8. Streaming
ETL
(Extract,
Transform,
Load)
Kite
Morphlines
• Consume
any
kind
of
data
from
any
kind
of
data
source,
process
and
load
into
Solr,
HDFS,
HBase
or
anything
else
• Simple
and
flexible
data
transforma&on
• Extensible
set
of
transf.
commands
• Reusable
across
mul&ple
workloads
• For
Batch
&
Near
Real
Time
• Configura&on
over
coding
• reduces
&me
&
skills
• ASL
licensed
on
Github
hdps://github.com/kite-‐sdk/kite
syslog
Flume
Agent
Solr
sink
Command:
readLine
Command:
grok
Command:
loadSolr
Solr
Event
Record
Record
Record
Document
8
9. Morphline
Example
–
syslog
with
grok
morphlines
:
[
{
id
:
morphline1
importCommands
:
[”org.kitesdk.**",
"org.apache.solr.**"]
commands
:
[
{
readLine
{}
}
{
grok
{
dic&onaryFiles
:
[/tmp/grok-‐dic&onaries]
expressions
:
{
message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_&mestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""
}
}
}
{
loadSolr
{}
}
]
}
]
Example
Input
<164>Feb
4
10:46:14
syslog
sshd[607]:
listening
on
0.0.0.0
port
22
Output
Record
syslog_pri:164
syslog_&mestamp:Feb
4
10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening
on
0.0.0.0
port
22.
9
10. Example
Java
Driver
Program
-‐
Can
be
wrapped
into
Spark
func&ons
/** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */
public static void main(String[] args) {
// compile morphline.conf file on the fly
File conf= new File(args[0]);
MorphlineContext ctx= new MorphlineContext.Builder().build();
Command morphline = new Compiler().compile(conf, null, ctx, null);
// process each input data file
Notifications.notifyBeginTransaction(morphline);
for (int i = 1; i < args.length; i++) {
InputStream in = new FileInputStream(new File(args[i]));
Record record = new Record();
record.put(Fields.ATTACHMENT_BODY, in);
morphline.process(record);
in.close();
}
Notifications.notifyCommitTransaction(morphline);
}
10
11. Scalable
Batch
Indexing
11
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
S0_0_0
Extractors
(Mappers)
Leaf Shards
(Reducers)
Root Shards
(Mappers)
S0_0_1
S0S0_1_0
S0_1_1
S1_0_0
S1_0_1
S1S1_1_0
S1_1_1
Input
Files
...
...
...
...
• Morphline
runs
inside
Mapper
• Reducers
build
local
Solr
indexes
• Mappers
merge
microshards
• GoLive
merges
into
live
SolrCloud
GoLive
GoLive
• Can
exploit
all
reducer
slots
even
if
#reducers
>>
#solrShards
• Great
throughput
but
poor
latency
• Only
inserts,
no
updates
&
deletes!
• Want
to
migrate
from
MR
to
Spark
12. Batching
Indexing
with
CrunchIndexerTool
12
spark-submit ... CrunchIndexerTool --morphline-file morphline.conf ...
or
hadoop ... CrunchIndexerTool --morphline-file morphline.conf ...
• Morphline
runs
inside
Spark
executors
• Morphline
sends
docs
to
live
SolrCloud
• Good
throughput
and
good
latency
• Supports
inserts,
updates
&
deletes
• Flag
to
run
on
Spark
or
MapReduce
Extractors
(Executors/
Mappers)
SolrCloud
Shards
S0
S1
Input
Files
...
...
...
...
13. More
CrunchIndexerTool
features
(1/2)
• Implemented
with
Apache
Crunch
library
• Eases
migra&on
from
MapReduce
execu&on
engine
to
Spark
execu&on
engine
–
can
run
on
either
engine
• Supported
Spark
modes
• Local
(for
tes&ng)
• YARN
client
• YARN
cluster
(for
produc&on)
• Efficient
batching
of
Solr
updates
and
deleteById
and
deleteByQuery
• Efficient
locality-‐aware
processing
for
splidable
HDFS
files
• avro,
parquet,
text
lines
13
14. More
CrunchIndexerTool
features
(2/2)
• Dry-‐run
mode
for
rapid
prototyping
• Sends
commit
to
Solr
on
job
success
• Inherits
Fault
tolerance
&
retry
from
Spark
(and
MR)
• Security
in
progress:
Kerberos
token
delega&on,
SSL
• ASL
licensed
on
Github
• hdps://github.com/cloudera/search/tree/
cdh5-‐1.0.0_5.3.0/search-‐crunch
14
15. Conclusions
• Easy
migra&on
from
MapReduce
to
Spark
• Also
supports
updates
&
deletes
&
good
latency
• Recommenda&on
• Use
MapReduceIndexerTool
for
large
scale
batch
inges&on
use
cases
where
updates
or
deletes
of
exis&ng
documents
in
Solr
are
not
required
• Use
CrunchIndexerTool
for
all
other
use
cases
• Shipping
in
CDH5.2
and
above
15