This document discusses the rise of the Hadoop ecosystem. It outlines how the ecosystem has expanded from the original Hadoop components of HDFS for storage and MapReduce for distributed computation. New frameworks have emerged that allow for real-time queries, updates, and machine learning on big data. These include Spark, Storm, Drill, and streaming engines. The ecosystem is now a complex network of interoperable tools for storage, computation, analytics and machine learning on large datasets.
25. MPP Database like performance
for Hadoop
- Created in 2012 by Cloudera
- x100 performance over Hive
(for certain queries)
26. Extensible architecture
for SQL Querying
• Started in 2013
• Apache Incubated Project
• Lucidworks
• Mapr
• ElasticSearch
• …
• Alpha Status
• Open architecture for supporting
SQL like queries to various data
sources:
• Cassandra
• MongoDB
• HDFS
• HBase
Apache DRILL
29. Update the Model Once per week
using the whole history
Apply the model for each user
using the very last events
Real-Time
Navigation
Real-Time
Recommendation
30. STORM Reliable Distributed
Real-Time Computations
- Connect to a variety of data
sources (HDFS, RabbitMQ, JMS etc..)
- Run Computation in java (native) or
python, ruby, perl …
- Guarantees that events are taken
processed
- Distributes workload
31. Write Map-Reduce like program
and executing either in
• Batch
• Real-Time
• Hybrid Batch / Real-Time
• Open Sourced By Twitter in 2013
• Built on top of Storm (and Cascading)
• Program in Scala