SlideShare a Scribd company logo
1 of 94
Hadoop ecosystem
• Hadoop ecosystem contains components
• like HDFS and HDFS components,
• MapReduce,
• YARN, Hive, Apache Pig, Apache HBase and
HBase components, HCatalog, Avro, Thrift,
Drill, Apache mahout, Sqoop, Apache Flume,
Ambari, Zookeeper and Apache OOzie
Hbase& Hcatalog
Apache Hbase:
• This is a Hadoop ecosystem component which is a distributed database
that was designed to store structured data in tables that could have billions
of row and millions of columns. HBase is scalable, distributed. HBase,
provide real-time access to read or write data in HDFS.
HCATALOG:
• It is a table and storage management layer for Hadoop. HCatalog supports
different components available in Hadoop ecosystems like MapReduce,
Hive, and Pig to easily read and write data from the cluster. HCatalog is a
key component of Hive that enables the user to store their data in any
format and structure.
• By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC
file formats.
Apache Mahout
• Mahout is open source framework for creating
scalable machine learning algorithm and data mining
library. Once data is stored in Hadoop HDFS, mahout
provides the data science tools to automatically find
meaningful patterns in those big data sets.
Algorithms of Mahout are:
• Clustering – Here it takes the item in particular class and
organizes them into naturally occurring groups, such that
item belonging to the same group are similar to each other.
• Collaborative filtering – It mines user behavior and makes
product recommendations (e.g. Amazon recommendations)
• Classifications – It learns from existing categorization and
then assigns unclassified items to the best category.
• Frequent pattern mining – It analyzes items in a group
(e.g. items in a shopping cart or terms in query session) and
then identifies which two items typically appear together.
Apache Sqoop & Flume
• Sqoop:
Imports data from external sources into related
Hadoop ecosystem components like HDFS,
Hbase or Hive.
It also exports data from Hadoop to other
external sources.
Sqoop works with relational databases such as
teradata, Netezza, oracle, MySQL.
Apache Flume
Flume efficiently collects, aggregate and moves a
large amount of data from its origin and sending it
back to HDFS.
It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows the
data flow from the source into Hadoop
environment.
It uses a simple extensible data model that allows
for the online analytic application.
Using Flume, we can get the data from multiple
servers immediately into hadoop.
Query languages for hadoop
• MapReduce (MR) is a criterion of Big Data
processing model with parallel and distributed
large datasets.
• This model knows difficult problems related to
low-level and batch nature of MR that gives rise
to an abstraction layer on the top of MR.
• Several High-Level MapReduce Query Languages
built on the top of MR provide more abstract
query languages and extend the MR
programming model
• These High-Level MapReduce Query Languages
remove the burden of MR programming away
from the developers and make a soft migration of
existing competences with SQL skills to Big Data.
• Common High-Level MapReduce Query
Languages built directly on the top of MR that
translate queries into executable native MR jobs.
• It evaluates the performance of the four
presented High-Level MapReduce Query
Languages: JAQL, Hive, Big SQL and Pig, with
regards to their insightful perspectives and ease
of programming.
Query languages for hadoop
• Pig, from Yahoo! and now incubating at
Apache, has an imperative language called Pig
Latin for performing operations on large data
files.
• Jaql, from IBM is a declarative query language
for JSON data.
• Hive, from Facebook is a data warehouse
system with a declarative query language that
is a hybrid of SQL and Hadoop streaming.
HIVE & PIG
Hive:
• The Hadoop ecosystem component, Apache Hive, is an
open source data warehouse system for querying and
analysing large datasets stored in Hadoop files.
• Hive do three main functions: data summarization, query,
and analysis.
• Hive use language called HiveQL (HQL), which is
similar to SQL.
• HiveQL automatically translates SQL-like queries
into MapReduce jobs which will execute on Hadoop.
Pig:
• Apache Pig is a high-level language platform
for analyzing and querying huge dataset that
are stored in HDFS.
• Pig as a component of Hadoop Ecosystem
uses PigLatin language.
• It is very similar to SQL.
• It loads the data, applies the required filters
and translate the data in the required format.
STREAM COMPUTING
• Big data stream computing is able to analyze and
process data in real time to gain an immediate insight,
and it is typically applied to the analysis of vast amount
of data in real time and to process them at a high
speed.
• A high-performance computer system that analyzes
multiple data streams from many sources live.
• The word stream in stream computing is used to mean
pulling in streams of data, processing the data and
streaming it back out as a single flow.
• Stream computing uses software algorithms that
analyzes the data in real time as it streams in to
(which)increase speed and accuracy when dealing with
data handling and analysis
• In a stream processing system, applications typically
act as continuous queries, ingesting data continuously,
analyzing and correlating the data, and generating a
stream of results.
• Applications are represented as data-flow graphs
composed of operators and interconnected by streams.
• The individual operators implement algorithms for data
analysis, such as parsing, filtering, feature extraction,
and classification.
• Such algorithms are typically single-pass because of the
high data rates of external feeds (e.g., market
information from stock exchanges, environmental
sensors readings from sites in a forest, etc.).
Streamcomputing
• IBM announced its stream computing system,
called System S.
• ATI Technologies also announced a stream
computing technology that describes its
technology that enables the graphics
processors (GPUs) to work in conjunction with
high-performance, low-latency CPUs to solve
complex computational problems.
PIG
PIG
• Pig raises the level of abstraction for processing large datasets.
• With Pig, the data structures are much richer, typically being
multivalued and nested; and the set of transformations you can
apply to the data are much more powerful
• Pig Latin, a Parallel Data Flow Language. Pig Latin is a data flow
language. This means it allows users to describe how data from one
or more inputs should be read, processed, and then stored to one
or more outputs in parallel.
Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There
are currently two environments: local execution in a single
JVM and distributed execution on a Hadoop cluster.
Apache Pig Components
There are several components in the Apache Pig framework.
Parser
• At first, all the Pig Scripts are handled by the Parser. Parser basically
checks the syntax of the script, does type checking, and other
miscellaneous checks. Afterwards, Parser’s output will be a DAG
(directed acyclic graph) that represents the Pig Latin statements as
well as logical operators.
The logical operators of the script are represented as the nodes and
the data flows are represented as edges in DAG (the logical plan)
Optimizer
• Afterwards, the logical plan (DAG) is passed to the logical optimizer.
It carries out the logical optimizations further such as projection
and push down.
Compiler
• Then compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
• Eventually, all the MapReduce jobs are
submitted to Hadoop in a sorted order.
• Ultimately, it produces the desired results
Important points
• A Pig Latin program is made up of a series of
operations, or transformations, that are applied to the
input data to produce output.
• Pig transforms your query into series of mapreduce
task and you unaware of this.
• You will focus on the data and you dont know nature of
execution.
• Pig is a scripting language for exploring large datasets.
• Pig’s sweet spot is its ability to process terabytes of
data simply by using only half-dozen lines of Pig Latin
from the console.
• Pig was designed to be extensible. Virtually all parts of
the processing path are customizable: loading, storing,
filtering, grouping, and joining can all be altered by
userdefined functions (UDFs).
• As another benefit, UDFs tend to be more reusable
than the libraries developed for writing MapReduce
programs.
• Pig isn’t suitable for all data processing tasks.
• If you want to perform a query that touches only a
small amount of data in a large dataset, then Pig will
not perform well, since it is set up to scan the whole
dataset, or at least large portions of it.
Execution Types
Pig has two execution types or modes:
• local mode and
• MapReduce mode.
Local mode:
• In local mode, Pig runs in a single JVM and accesses the local
filesystem. This mode is suitable only for small datasets and when
trying out Pig.
• The execution type is set using the -x or -exectype option. To run in
local mode, set the option to local.
• % pig -x local
• grunt>
• This starts Grunt, the Pig interactive shell.
MapReduce mode
• In MapReduce mode, Pig translates queries into
MapReduce jobs and runs them on a Hadoop
cluster.
• The cluster may be a pseudo or fully distributed
cluster.
• To use MapReduce mode, you first need to check
that the version of Pig you downloaded is
compatible with the version of Hadoop you are
using. Pig releases will only work against
particular versions of Hadoop.
Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
Script:
Pig can run a script file that contains Pig commands. For example, pig script.pig
runs the commands in the local file script.pig. Alternatively, for very short scripts,
you can use the -e option to run a script specified as a string on the command line.
Grunt:
• Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to run
Pig scripts from within Grunt using run and exec.
Embedded:
• You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.
• Interactive Mode (Grunt shell) − You can run
Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump
operator).
• Batch Mode (Script) − You can run Apache Pig in
Batch mode by writing the Pig Latin script in a
single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides
the provision of defining our own functions
(User Defined Functions) in programming
languages such as Java, and using them in our
script.
• We can run your Pig scripts in the shell after
invoking the Grunt shell. Moreover, there are
certain useful shell and utility commands
offered by the Grunt shell.
PigLatin
• Apache Pig offers High-level language like Pig
Latin to perform data analysis programs.
• A Pig Latin program consists of a collection of
statements. A statement can be thought of as
an operation, or a command. For example, a
GROUP operation is a type of statement.
grouped_records = GROUP records BY year;
Statements are usually terminated with a semicolon, as in the example of the
GROUPstatement. In fact, this is an example of a statement that must be
terminated with a semicolon: it is a syntax error to omit it. In Grunt no error
• While we need to analyze data in Hadoop
using Apache Pig, we use Pig Latin language.
• Basically, first, we need to transform Pig Latin
statements into MapReduce jobs using an
interpreter layer. In this way, the Hadoop
process these jobs.
• Pig Latin is a very simple language with SQL
like semantics.
• It is possible to use it in a productive manner.
• It also contains a rich set of functions.
• Those exhibits data manipulation.
• Moreover, by writing user-defined functions
(UDF) using Java, we can extend them easily.
• That implies they are extensible in nature.
Data Model in Pig Latin
• The data model of Pig is fully nested. In
addition, the outermost structure of the Pig
Latin data model is a Relation. Also, it is a bag.
While−
• A bag, what we call a collection of tuples.
• A tuple, what we call an ordered set of fields.
• A field, what we call a piece of data.
Statements in Pig Latin
• Also, make sure, statements are the basic constructs while
processing data using Pig Latin.
• Basically, statements work with relations. Also, includes
expressions and schemas.
• Here, every statement ends with a semicolon (;).
• Moreover, through statements, we will perform several
operations using operators, those are offered by Pig Latin.
• However, Pig Latin statements take a relation as input and
produce another relation as output, while performing all
other operations Except LOAD and STORE.
• Its semantic checking will be carried out, once we enter a
Load statement in the Grunt shell.
• Although, we need to use the Dump operator, in order to
see the contents of the schema.
• Because, the MapReduce job for loading the data into the
file system will be carried out, only after performing the
dump operation.
Pig Latin Datatypes
• int
• “Int” represents a signed 32-bit integer.
For Example: 10
• long
• It represents a signed 64-bit integer.
For Example: 10L
• float
• This data type represents a signed 32-bit floating point.
For Example: 10.5F
• double
• “double” represents a 64-bit floating point.
For Example: 10.5
• chararray
• It represents a character array (string) in Unicode UTF-8 format.
For Example: ‘Data Flair’
• Bytearray
• This data type represents a Byte array (blob).
• Boolean
• “Boolean” represents a Boolean value.
For Example : true/ false.
Note: It is case insensitive.
• Datetime
• It represents a date-time.
For Example : 1970-01-01T00:00:00.000+00:00
• Biginteger
• This data type represents a Java BigInteger.
For Example: 60708090709
• Bigdecimal
• “Bigdecimal” represents a Java BigDecimal
For Example: 185.98376256272893883
Complex Types
• Tuple
• Bag
• Map
Pig Latin Operators
Arithmetic Operators
Comparison Operators
Type Construction Operators
Data Processing Operators
Loading and Storing
• LOAD
It loads the data from a file system into a relation.
• STORE
It stores a relation to the file system (local/HDFS).
Filtering
• FILTER
There is a removal of unwanted rows from a relation.
• DISTINCT
We can remove duplicate rows from a relation by this
operator.
• FOREACH, GENERATE
It transforms the data based on the columns of data.
• STREAM
To transform a relation using an external program.
• Diagnostic Operators
• DUMP
It prints the content of a relationship through the
console.
• DESCRIBE
It describes the schema of a relation.
• EXPLAIN
We can view the logical, physical execution plans to
evaluate a relation.
• ILLUSTRATE
It displays all the execution steps as the series of
statements.
Grouping and Joining
• JOIN We can join two or more relations.
• COGROUP There is a grouping of the data into two or more
relations.
• GROUPIt groups the data in a single relation.
• CROSSWe can create the cross product of two or more relations.
Sorting
• ORDER
It arranges a relation in an order based on one or more fields.
• LIMIT
We can get a particular number of tuples from a relation.
Combining and Splitting
• UNION
We can combine two or more relations into one relation.
• SPLIT
To split a single relation into more relations.
Hive
• Apache Hive is an open source data warehouse system
built on top of Hadoop ,for querying and analyzing
large datasets stored in Hadoop files.
• It process structured and semi-structured data in
Hadoop.
• Hive runs on your workstation and converts your SQL
query into a series of Map Reduce jobs for execution
on a Hadoop cluster.
• Hive organizes data into tables, which provide a means
for attaching structure to data stored in HDFS.
• Metadata such as table schemas is stored in a
database called the meta store.
Metastore
• It stores metadata for each and every table
• Hive also includes the partition metadata.
• This helps the driver to track the progress of
various data sets distributed over the cluster.
• It stores the data in a traditional RDBMS
format.
• Backup server regularly replicates the data
which it can retrieve in case of data loss.
Driver
• It acts like a controller which receives the HiveQL
statements.
• The driver starts the execution of the statement
by creating sessions.
• It monitors the life cycle and progress of the
execution.
• Driver stores the necessary metadata generated
during the execution of a HiveQL statement.
• It also acts as a collection point of data or query
result obtained after the Reduce operation
Compiler
• It performs the compilation of the HiveQL query.
• This converts the query to an execution plan. The
plan contains the tasks.
• It also contains steps needed to be performed by
the MapReduce to get the output as translated by
the query.
• The compiler in Hive converts the query to
an Abstract Syntax Tree (AST).
• First, check for compatibility and compile-time
errors, then converts the AST to a Directed
Acyclic Graph (DAG).
• Optimizer – It performs various
transformations on the execution plan to
provide optimized DAG. It aggregates the
transformations together, such as converting a
pipeline of joins to a single join, for better
performance. The optimizer can also split the
tasks, such as applying a transformation on
data before a reduce operation, to provide
better performance.
• Executor – Once compilation and optimization
complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.
• CLI, UI, and Thrift Server –
• CLI (command-line interface) provides a user
interface for an external user to interact with
Hive.
• Thrift server in Hive allows external clients to
interact with Hive over a network, similar to
the JDBC or ODBC protocols.
Hive Shell
• The shell is the primary way that we will interact with Hive, by
issuing commands in HiveQL.
• HiveQL is Hive’s query language, a dialect of SQL. It is heavily
influenced by MySQL
• When starting Hive for the first time, we can check that it is working
by listing its tables The command must be terminated with a
semicolon to tell Hive to execute it:
• Like SQL, HiveQL is generally case insensitive (except for string
comparisons), so show tables; works equally well here. The tab key
will auto complete Hive keywords and functions.
• For a fresh install, the command takes a few seconds to run since it
is lazily creating the metastore database on your machine. (The
database stores its files in a directory called metastore_db, which is
relative to where you ran the hive command from.)
hive> SHOW TABLES;
OK
Time taken: 10.425 seconds
• Hive Client
• Hive Services
• Processing and Resource Management
• Distributed Storage
Hive Client
• Hive supports applications written in any language like
Python, Java, C++, Ruby, etc. Using JDBC, ODBC, and Thrift
drivers, for performing queries on the Hive. Hence, one can
easily write a hive client application in any language of its
own choice.
• Hive clients are categorized into three types:
1. Thrift Clients
• The Hive server is based on Apache Thrift so that it can
serve the request from a thrift client.
2. JDBC client
• Hive allows for the Java applications to connect to it using
the JDBC driver. JDBC driver uses Thrift to communicate
with the Hive Server.
3. ODBC client
• Hive ODBC driver allows applications based on the ODBC
protocol to connect to Hive. Similar to the JDBC driver, the
ODBC driver uses Thrift to communicate with the Hive
Server.
Hive Service
cli
• The command line interface to Hive (the shell). This is the default service.
• To perform all queries, Hive provides various services like the Hive server2,
Beeline, etc. The various services offered by Hive are:
Hive sever
• HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients
to execute queries against the Hive. It allows multiple clients to submit
requests to Hive and retrieve the final results. It is basically designed to
provide the best support for open API clients like JDBC and ODBC.
hwi
• The Hive Web Interface.
jar
• The Hive equivalent to hadoop jar, a convenient way to run Java
applications that
• includes both Hadoop and Hive classes on the classpath.
Meta Store
• Metastore is a central repository that stores the metadata
information about the structure of tables and partitions, including
column and column type information.
• It also stores information of serializer and deserializer, required for
the read/write operation, and HDFS files where data is stored. This
metastore is generally a relational database.
• Metastore provides a Thrift interface for querying and manipulating
Hive metadata.
We can configure metastore in any of the two modes:
• Remote: In remote mode, metastore is a Thrift service and is useful
for non-Java applications.
• Embedded: In embedded mode, the client can directly interact with
the metastore using JDBC.
Embedded
• The metastore is the central repository of Hive metadata.
The metastore is divided into two pieces: a service and the
backing store for the data. By default, the metastore service
runs in the same JVM as the Hive service and contains an
embedded database instance backed by the local disk. This
is called the embedded metastore configuration
• Using an embedded metastore is a simple way to get
started with Hive; however, only one embedded Derby
database can access the database files on disk at any one
time, which means you can only have one Hive session
open at a time that shares the same metastore. Trying to
start a second session gives the error:
• Failed to start database 'metastore_db‘
• when it attempts to open a connection to the metastore
• The solution to supporting multiple sessions (and
therefore multiple users) is to use a standalone
database. This configuration is referred to as a
local metastore, since the metastore service still
runs in the same process as the Hive service, but
connects to a database running in a separate
process, either on the same machine or on a
remote machine.
• There’s another metastore configuration called a
remote metastore, where one or more metastore
servers run in separate processes to the Hive
service. This brings better manageability and
security, since the database tier can be
completely firewalled off, and the clients no
longer need the database credentials.
SQl vs HiveQL
HiveQL
• The Hive Query Language (HiveQL) is a query
language for Hive to process and analyze
structured data in a Metastore.
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Creating Data Base:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Data Types in Hive
Complex Data Types
Opertators
• Relational Operators
• Arithmetic Operators
• Logical Operators
• String Operators
• Operators on Complex Types
Hive DDL commands
• Hive DDL commands are the statements used for defining
and changing the structure of a table or database in Hive. It
is used to build or modify the tables and other objects in
the database.
• The several types of Hive DDL commands are:
• CREATE
• SHOW
• DESCRIBE
• USE
• DROP
• ALTER
• TRUNCATE
Hive DML Commands
• Hive DML (Data Manipulation Language) commands are
used to insert, update, retrieve, and delete data from the
Hive table once the table and database schema has been
defined using Hive DDL commands.
• The various Hive DML commands are:
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
Joins
• Inner join in Hive
• Left Outer Join in Hive
• Right Outer Join in Hive
• Full Outer Join in Hive
Partition
• Apache Hive organizes tables into partitions. Partitioning is
a way of dividing a table into related parts based on the
values of particular columns like date, city, and department.
• Each table in the hive can have one or more partition keys
to identify a particular partition. Using partition it is easy to
do queries on slices of the data.
• Here are two types of Partitioning in Apache Hive-
Static Partitioning
Dynamic Partitioning
• In Apache Hive for decomposing table data sets into more
manageable parts, it uses Hive Bucketing concept.
However, there are much more to learn about Bucketing in
Hive.
HBase
Hbasics
• HBase is a distributed column-oriented database built on
top of HDFS.
• HBase is the Hadoop application to use when you require
real-time read/write random-access to very large datasets.
Why Hbase:
• RDBMS get exponentially slow as the data becomes large
• Expects data to be highly structured, i.e. ability to fit in a
well-defined schema
• Any change in schema might require a downtime
Hbase concepts
There are 3 types of servers in a master-slave type of
HBase Architecture. They are
HBase HMaster
Server
ZooKeeper
• Region servers, these servers serve data for reads and
write purposes. That means clients can directly
communicate with HBase Region Servers while
accessing data.
• The HBase Master process handles the region
assignment as well as DDL (create, delete tables)
operations. And finally, a part of HDFS, Zookeeper.
HMasterServer
• The master server -Assigns regions to the region
servers and takes the help of Apache ZooKeeper
for this task.
• Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families.
Regions
• Regions are nothing but tables that are split up and
spread across the region servers.
Region server
• The region servers have regions that -Communicate
with the client and handle data-related operations.
• Handle read and write requests for all the regions
under it.
• Decide the size of the region by following the region
size thresholds.
Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has nodes representing different region
servers. Master servers use these nodes to discover
available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take
care of zookeeper.
Regions
• Tables are automatically partitioned horizontally by
HBase into regions. Each region comprises a subset
of a table’s rows.
• A region is denoted by the table it belongs to.
• Initially, a table comprises a single region, but as the
size increases, after it crosses a configurable size
threshold, it splits at a row boundary into two new
regions of approximately equal size.
• Until this first split happens, all loading will be
against the single server hosting the original region.
• As the table grows, the number of its regions
grows. Regions are the units that get
distributed over an HBase Server
• In this way, a table that is too big for any one
server can be carried by a cluster of servers
with each node hosting a subset of the table’s
total regions.
• Load on a table gets distributed.
• The online set of sorted regions comprises the
table’s total content.
• To maintain server state in the HBase Cluster,
HBase uses ZooKeeper as a distributed
coordination service.
• Basically, which servers are alive and available
is maintained by Zookeeper, and also it
provides server failure notification.
• Moreove, Zookeeper maintains guarantee
common shared state.
Clients
There are a number of client options for interacting
with an Hbase cluster.
• Java
• HBase, like Hadoop, is written in Java
• MapReduce HBase classes and utilities in the
org.apache.hadoop.hbase.mapreduce package
facilitate using HBase as a source and/or sink in
MapReduce jobs.
• The TableInputFormat class makes splits on
region boundaries.
• The Hbase TableOutputFormat will write the
result of reduce into HBase.
• Hbase with Avro, REST, and Thrift interfaces.
These are useful when the interacting
application is written in a language other than
Java.
• In all cases, a Java server hosts an instance of
the HBase client brokering application Avro,
REST, and Thrift requests in and out of the
HBase cluster. This extra work proxying
requests and responses means these
interfaces are slower than using the Java client
directly.
HBase Vs RDBMS
Database Type
HBase
• HBase is the column-oriented database. On
defining Column-oriented, each column is a
contiguous unit of page.
RDBMS
• Whereas, RDBMS is row-oriented that means
here each row is a contiguous unit of page.
Schema-type
• Schema of HBase is less restrictive, adding
columns on the fly is possible. RDBMS
Schema of RDBMS is more restrictive.
Sparse Tables
HBase
• HBase is good with the Sparse table.
RDBMS
• Whereas, RDBMS is not optimized for sparse tables.
Scale up/ Scale out
HBase
• HBase supports scale out. It means while we need memory
processing power and more disk, we need to add new servers to
the cluster rather than upgrading the present one.
RDBMS
• However, RDBMS supports scale up. That means while we need
memory processing power and more disk, we need upgrade same
server to a more powerful server, rather than adding new servers.
Amount of data
HBase
• While here it does not depend on the particular
machine but the number of machines.
RDBMS
• In RDBMS, on the configuration of the server,
amount of data depends.
Support
HBase
• For HBase, there is no built-in support.
RDBMS
• And, RDBMS has ACID support.
Data type
HBase
• HBase supports both structured and nonstructural data.
RDBMS
• RDBMS is suited for structured data.
Transaction integrity
HBase
• In HBase, there is no transaction guaranty.
RDBMS
• Whereas, RDBMS mostly guarantees transaction integrity.
JOINs
HBase
• HBase supports JOINs.
RDBMS
• RDBMS does not support JOINs.
Referential integrity
HBase
• While it comes to referential integrity, there is
no in-built support.
RDBMS
• And, RDBMS, supports referential integrity.
Bigsql
• IBM Big SQL is a high performance massively
parallel processing (MPP) SQL engine for
Hadoop that makes querying enterprise data
from the organization in an easy and secure
experience.
• A Big SQL query can quickly access a variety of
data sources including HDFS, RDBMS, NoSQL
databases, object stores, and WebHDFS by
using a single database connection or single
query for best-in-class analytic capabilities.
• With Big SQL, your organization can derive
significant value from your enterprise data.
• Big SQL provides tools to help you manage
your system and your databases, and you can
use popular analytic tools to visualize your
data.
• Big SQL includes several tools and interfaces
that are largely comparable to tools and
interfaces that exist with most relational
database management systems.
• Big SQL's robust engine executes complex
queries for relational data and Hadoop data.
• Big SQL provides an advanced SQL compiler
and a cost-based optimizer for efficient query
execution.
• Combining these with a massive parallel
processing (MPP) engine helps distribute
query execution across nodes in a cluster.
• The Big SQL architecture uses the latest
relational database technology from IBM.
• The database infrastructure provides a logical
view of the data (by allowing storage and
management of metadata) and a view of the
query compilation, plus the optimization and
runtime environment for optimal SQL
processing.
• Applications connect on a specific node based
on specific user configurations.
• SQL statements are routed through this node,
to Big SQL management node, or the
coordinating node.
• There can be one or many management
nodes, but there is only one Big SQL
management node. SQL statements are
compiled and optimized to generate a parallel
execution query plan.
• Then, a runtime engine distributes task(query)
to worker nodes on the compute node and
manipulates the consumption and return of
the result set.
• The compute node is a node that can be a
physical server or operating system.
• The worker nodes can contain the temporary
tables, the runtime execution, the readers and
writers, and the data nodes.
• The DataNode holds the data.
• When a worker node receives a query, it
dispatches special processes that know how to
read and write HDFS data natively.
• Big SQL uses native and Java open source–
readers (and writers) that are able to ingest
different file formats.
• The Big SQL engine pushes predicates down to
these processes so that they can, in turn,
apply projection and selection closer to the
data. These processes also transform input
data into an appropriate format for
consumption inside Big SQL.
• All of these nodes can be on
one Management Node, or each part on
separate Management Node.
• We can separate the Big SQL management
node from the other Hadoop master nodes.
• This arrangement can allow the Big SQL
management node to have enough resources
to store intermediate data from the Big SQL
data nodes.

More Related Content

Similar to BDA R20 21NM - Summary Big Data Analytics

A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptxVIJAYAPRABAP
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxinfinix8
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 

Similar to BDA R20 21NM - Summary Big Data Analytics (20)

A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
GETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptxGETTING YOUR DATA IN HADOOP.pptx
GETTING YOUR DATA IN HADOOP.pptx
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Recently uploaded (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

BDA R20 21NM - Summary Big Data Analytics

  • 1. Hadoop ecosystem • Hadoop ecosystem contains components • like HDFS and HDFS components, • MapReduce, • YARN, Hive, Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie
  • 2.
  • 3. Hbase& Hcatalog Apache Hbase: • This is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of row and millions of columns. HBase is scalable, distributed. HBase, provide real-time access to read or write data in HDFS. HCATALOG: • It is a table and storage management layer for Hadoop. HCatalog supports different components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from the cluster. HCatalog is a key component of Hive that enables the user to store their data in any format and structure. • By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
  • 4. Apache Mahout • Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to automatically find meaningful patterns in those big data sets. Algorithms of Mahout are: • Clustering – Here it takes the item in particular class and organizes them into naturally occurring groups, such that item belonging to the same group are similar to each other. • Collaborative filtering – It mines user behavior and makes product recommendations (e.g. Amazon recommendations) • Classifications – It learns from existing categorization and then assigns unclassified items to the best category. • Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or terms in query session) and then identifies which two items typically appear together.
  • 5. Apache Sqoop & Flume • Sqoop: Imports data from external sources into related Hadoop ecosystem components like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.
  • 6. Apache Flume Flume efficiently collects, aggregate and moves a large amount of data from its origin and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem component allows the data flow from the source into Hadoop environment. It uses a simple extensible data model that allows for the online analytic application. Using Flume, we can get the data from multiple servers immediately into hadoop.
  • 7. Query languages for hadoop • MapReduce (MR) is a criterion of Big Data processing model with parallel and distributed large datasets. • This model knows difficult problems related to low-level and batch nature of MR that gives rise to an abstraction layer on the top of MR. • Several High-Level MapReduce Query Languages built on the top of MR provide more abstract query languages and extend the MR programming model
  • 8. • These High-Level MapReduce Query Languages remove the burden of MR programming away from the developers and make a soft migration of existing competences with SQL skills to Big Data. • Common High-Level MapReduce Query Languages built directly on the top of MR that translate queries into executable native MR jobs. • It evaluates the performance of the four presented High-Level MapReduce Query Languages: JAQL, Hive, Big SQL and Pig, with regards to their insightful perspectives and ease of programming.
  • 9.
  • 10. Query languages for hadoop • Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files. • Jaql, from IBM is a declarative query language for JSON data. • Hive, from Facebook is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.
  • 11. HIVE & PIG Hive: • The Hadoop ecosystem component, Apache Hive, is an open source data warehouse system for querying and analysing large datasets stored in Hadoop files. • Hive do three main functions: data summarization, query, and analysis. • Hive use language called HiveQL (HQL), which is similar to SQL. • HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop.
  • 12. Pig: • Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in HDFS. • Pig as a component of Hadoop Ecosystem uses PigLatin language. • It is very similar to SQL. • It loads the data, applies the required filters and translate the data in the required format.
  • 13. STREAM COMPUTING • Big data stream computing is able to analyze and process data in real time to gain an immediate insight, and it is typically applied to the analysis of vast amount of data in real time and to process them at a high speed. • A high-performance computer system that analyzes multiple data streams from many sources live. • The word stream in stream computing is used to mean pulling in streams of data, processing the data and streaming it back out as a single flow. • Stream computing uses software algorithms that analyzes the data in real time as it streams in to (which)increase speed and accuracy when dealing with data handling and analysis
  • 14. • In a stream processing system, applications typically act as continuous queries, ingesting data continuously, analyzing and correlating the data, and generating a stream of results. • Applications are represented as data-flow graphs composed of operators and interconnected by streams. • The individual operators implement algorithms for data analysis, such as parsing, filtering, feature extraction, and classification. • Such algorithms are typically single-pass because of the high data rates of external feeds (e.g., market information from stock exchanges, environmental sensors readings from sites in a forest, etc.).
  • 16. • IBM announced its stream computing system, called System S. • ATI Technologies also announced a stream computing technology that describes its technology that enables the graphics processors (GPUs) to work in conjunction with high-performance, low-latency CPUs to solve complex computational problems.
  • 17. PIG
  • 18. PIG • Pig raises the level of abstraction for processing large datasets. • With Pig, the data structures are much richer, typically being multivalued and nested; and the set of transformations you can apply to the data are much more powerful • Pig Latin, a Parallel Data Flow Language. Pig Latin is a data flow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. Pig is made up of two pieces: • The language used to express data flows, called Pig Latin. • The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.
  • 19.
  • 20. Apache Pig Components There are several components in the Apache Pig framework. Parser • At first, all the Pig Scripts are handled by the Parser. Parser basically checks the syntax of the script, does type checking, and other miscellaneous checks. Afterwards, Parser’s output will be a DAG (directed acyclic graph) that represents the Pig Latin statements as well as logical operators. The logical operators of the script are represented as the nodes and the data flows are represented as edges in DAG (the logical plan) Optimizer • Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries out the logical optimizations further such as projection and push down.
  • 21. Compiler • Then compiler compiles the optimized logical plan into a series of MapReduce jobs. Execution engine • Eventually, all the MapReduce jobs are submitted to Hadoop in a sorted order. • Ultimately, it produces the desired results
  • 22. Important points • A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output. • Pig transforms your query into series of mapreduce task and you unaware of this. • You will focus on the data and you dont know nature of execution. • Pig is a scripting language for exploring large datasets. • Pig’s sweet spot is its ability to process terabytes of data simply by using only half-dozen lines of Pig Latin from the console.
  • 23. • Pig was designed to be extensible. Virtually all parts of the processing path are customizable: loading, storing, filtering, grouping, and joining can all be altered by userdefined functions (UDFs). • As another benefit, UDFs tend to be more reusable than the libraries developed for writing MapReduce programs. • Pig isn’t suitable for all data processing tasks. • If you want to perform a query that touches only a small amount of data in a large dataset, then Pig will not perform well, since it is set up to scan the whole dataset, or at least large portions of it.
  • 24. Execution Types Pig has two execution types or modes: • local mode and • MapReduce mode. Local mode: • In local mode, Pig runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets and when trying out Pig. • The execution type is set using the -x or -exectype option. To run in local mode, set the option to local. • % pig -x local • grunt> • This starts Grunt, the Pig interactive shell.
  • 25. MapReduce mode • In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. • The cluster may be a pseudo or fully distributed cluster. • To use MapReduce mode, you first need to check that the version of Pig you downloaded is compatible with the version of Hadoop you are using. Pig releases will only work against particular versions of Hadoop.
  • 26. Running Pig Programs There are three ways of executing Pig programs, all of which work in both local and MapReduce mode: Script: Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line. Grunt: • Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts from within Grunt using run and exec. Embedded: • You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java. For programmatic access to Grunt, use PigRunner.
  • 27. • Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator). • Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension. • Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.
  • 28. • We can run your Pig scripts in the shell after invoking the Grunt shell. Moreover, there are certain useful shell and utility commands offered by the Grunt shell.
  • 29.
  • 30. PigLatin • Apache Pig offers High-level language like Pig Latin to perform data analysis programs. • A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command. For example, a GROUP operation is a type of statement. grouped_records = GROUP records BY year; Statements are usually terminated with a semicolon, as in the example of the GROUPstatement. In fact, this is an example of a statement that must be terminated with a semicolon: it is a syntax error to omit it. In Grunt no error
  • 31. • While we need to analyze data in Hadoop using Apache Pig, we use Pig Latin language. • Basically, first, we need to transform Pig Latin statements into MapReduce jobs using an interpreter layer. In this way, the Hadoop process these jobs. • Pig Latin is a very simple language with SQL like semantics. • It is possible to use it in a productive manner.
  • 32. • It also contains a rich set of functions. • Those exhibits data manipulation. • Moreover, by writing user-defined functions (UDF) using Java, we can extend them easily. • That implies they are extensible in nature.
  • 33. Data Model in Pig Latin • The data model of Pig is fully nested. In addition, the outermost structure of the Pig Latin data model is a Relation. Also, it is a bag. While− • A bag, what we call a collection of tuples. • A tuple, what we call an ordered set of fields. • A field, what we call a piece of data.
  • 34. Statements in Pig Latin • Also, make sure, statements are the basic constructs while processing data using Pig Latin. • Basically, statements work with relations. Also, includes expressions and schemas. • Here, every statement ends with a semicolon (;). • Moreover, through statements, we will perform several operations using operators, those are offered by Pig Latin. • However, Pig Latin statements take a relation as input and produce another relation as output, while performing all other operations Except LOAD and STORE. • Its semantic checking will be carried out, once we enter a Load statement in the Grunt shell. • Although, we need to use the Dump operator, in order to see the contents of the schema. • Because, the MapReduce job for loading the data into the file system will be carried out, only after performing the dump operation.
  • 35. Pig Latin Datatypes • int • “Int” represents a signed 32-bit integer. For Example: 10 • long • It represents a signed 64-bit integer. For Example: 10L • float • This data type represents a signed 32-bit floating point. For Example: 10.5F • double • “double” represents a 64-bit floating point. For Example: 10.5 • chararray • It represents a character array (string) in Unicode UTF-8 format. For Example: ‘Data Flair’ • Bytearray • This data type represents a Byte array (blob).
  • 36. • Boolean • “Boolean” represents a Boolean value. For Example : true/ false. Note: It is case insensitive. • Datetime • It represents a date-time. For Example : 1970-01-01T00:00:00.000+00:00 • Biginteger • This data type represents a Java BigInteger. For Example: 60708090709 • Bigdecimal • “Bigdecimal” represents a Java BigDecimal For Example: 185.98376256272893883
  • 37. Complex Types • Tuple • Bag • Map Pig Latin Operators Arithmetic Operators Comparison Operators Type Construction Operators
  • 38. Data Processing Operators Loading and Storing • LOAD It loads the data from a file system into a relation. • STORE It stores a relation to the file system (local/HDFS). Filtering • FILTER There is a removal of unwanted rows from a relation. • DISTINCT We can remove duplicate rows from a relation by this operator. • FOREACH, GENERATE It transforms the data based on the columns of data. • STREAM To transform a relation using an external program.
  • 39. • Diagnostic Operators • DUMP It prints the content of a relationship through the console. • DESCRIBE It describes the schema of a relation. • EXPLAIN We can view the logical, physical execution plans to evaluate a relation. • ILLUSTRATE It displays all the execution steps as the series of statements.
  • 40. Grouping and Joining • JOIN We can join two or more relations. • COGROUP There is a grouping of the data into two or more relations. • GROUPIt groups the data in a single relation. • CROSSWe can create the cross product of two or more relations. Sorting • ORDER It arranges a relation in an order based on one or more fields. • LIMIT We can get a particular number of tuples from a relation. Combining and Splitting • UNION We can combine two or more relations into one relation. • SPLIT To split a single relation into more relations.
  • 41. Hive • Apache Hive is an open source data warehouse system built on top of Hadoop ,for querying and analyzing large datasets stored in Hadoop files. • It process structured and semi-structured data in Hadoop. • Hive runs on your workstation and converts your SQL query into a series of Map Reduce jobs for execution on a Hadoop cluster. • Hive organizes data into tables, which provide a means for attaching structure to data stored in HDFS. • Metadata such as table schemas is stored in a database called the meta store.
  • 42.
  • 43. Metastore • It stores metadata for each and every table • Hive also includes the partition metadata. • This helps the driver to track the progress of various data sets distributed over the cluster. • It stores the data in a traditional RDBMS format. • Backup server regularly replicates the data which it can retrieve in case of data loss.
  • 44. Driver • It acts like a controller which receives the HiveQL statements. • The driver starts the execution of the statement by creating sessions. • It monitors the life cycle and progress of the execution. • Driver stores the necessary metadata generated during the execution of a HiveQL statement. • It also acts as a collection point of data or query result obtained after the Reduce operation
  • 45. Compiler • It performs the compilation of the HiveQL query. • This converts the query to an execution plan. The plan contains the tasks. • It also contains steps needed to be performed by the MapReduce to get the output as translated by the query. • The compiler in Hive converts the query to an Abstract Syntax Tree (AST). • First, check for compatibility and compile-time errors, then converts the AST to a Directed Acyclic Graph (DAG).
  • 46. • Optimizer – It performs various transformations on the execution plan to provide optimized DAG. It aggregates the transformations together, such as converting a pipeline of joins to a single join, for better performance. The optimizer can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance. • Executor – Once compilation and optimization complete, the executor executes the tasks. Executor takes care of pipelining the tasks.
  • 47. • CLI, UI, and Thrift Server – • CLI (command-line interface) provides a user interface for an external user to interact with Hive. • Thrift server in Hive allows external clients to interact with Hive over a network, similar to the JDBC or ODBC protocols.
  • 48. Hive Shell • The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. • HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL • When starting Hive for the first time, we can check that it is working by listing its tables The command must be terminated with a semicolon to tell Hive to execute it: • Like SQL, HiveQL is generally case insensitive (except for string comparisons), so show tables; works equally well here. The tab key will auto complete Hive keywords and functions. • For a fresh install, the command takes a few seconds to run since it is lazily creating the metastore database on your machine. (The database stores its files in a directory called metastore_db, which is relative to where you ran the hive command from.) hive> SHOW TABLES; OK Time taken: 10.425 seconds
  • 49.
  • 50. • Hive Client • Hive Services • Processing and Resource Management • Distributed Storage
  • 51. Hive Client • Hive supports applications written in any language like Python, Java, C++, Ruby, etc. Using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily write a hive client application in any language of its own choice. • Hive clients are categorized into three types: 1. Thrift Clients • The Hive server is based on Apache Thrift so that it can serve the request from a thrift client. 2. JDBC client • Hive allows for the Java applications to connect to it using the JDBC driver. JDBC driver uses Thrift to communicate with the Hive Server. 3. ODBC client • Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive Server.
  • 52. Hive Service cli • The command line interface to Hive (the shell). This is the default service. • To perform all queries, Hive provides various services like the Hive server2, Beeline, etc. The various services offered by Hive are: Hive sever • HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients to execute queries against the Hive. It allows multiple clients to submit requests to Hive and retrieve the final results. It is basically designed to provide the best support for open API clients like JDBC and ODBC. hwi • The Hive Web Interface. jar • The Hive equivalent to hadoop jar, a convenient way to run Java applications that • includes both Hadoop and Hive classes on the classpath.
  • 53. Meta Store • Metastore is a central repository that stores the metadata information about the structure of tables and partitions, including column and column type information. • It also stores information of serializer and deserializer, required for the read/write operation, and HDFS files where data is stored. This metastore is generally a relational database. • Metastore provides a Thrift interface for querying and manipulating Hive metadata. We can configure metastore in any of the two modes: • Remote: In remote mode, metastore is a Thrift service and is useful for non-Java applications. • Embedded: In embedded mode, the client can directly interact with the metastore using JDBC.
  • 54. Embedded • The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded database instance backed by the local disk. This is called the embedded metastore configuration • Using an embedded metastore is a simple way to get started with Hive; however, only one embedded Derby database can access the database files on disk at any one time, which means you can only have one Hive session open at a time that shares the same metastore. Trying to start a second session gives the error: • Failed to start database 'metastore_db‘ • when it attempts to open a connection to the metastore
  • 55. • The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service, but connects to a database running in a separate process, either on the same machine or on a remote machine. • There’s another metastore configuration called a remote metastore, where one or more metastore servers run in separate processes to the Hive service. This brings better manageability and security, since the database tier can be completely firewalled off, and the clients no longer need the database credentials.
  • 56.
  • 57.
  • 59. HiveQL • The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number]; Creating Data Base: CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
  • 60.
  • 63. Opertators • Relational Operators • Arithmetic Operators • Logical Operators • String Operators • Operators on Complex Types
  • 64. Hive DDL commands • Hive DDL commands are the statements used for defining and changing the structure of a table or database in Hive. It is used to build or modify the tables and other objects in the database. • The several types of Hive DDL commands are: • CREATE • SHOW • DESCRIBE • USE • DROP • ALTER • TRUNCATE
  • 65. Hive DML Commands • Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and delete data from the Hive table once the table and database schema has been defined using Hive DDL commands. • The various Hive DML commands are: • LOAD • SELECT • INSERT • DELETE • UPDATE • EXPORT • IMPORT
  • 66. Joins • Inner join in Hive • Left Outer Join in Hive • Right Outer Join in Hive • Full Outer Join in Hive
  • 67. Partition • Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. • Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data. • Here are two types of Partitioning in Apache Hive- Static Partitioning Dynamic Partitioning • In Apache Hive for decomposing table data sets into more manageable parts, it uses Hive Bucketing concept. However, there are much more to learn about Bucketing in Hive.
  • 68. HBase Hbasics • HBase is a distributed column-oriented database built on top of HDFS. • HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets. Why Hbase: • RDBMS get exponentially slow as the data becomes large • Expects data to be highly structured, i.e. ability to fit in a well-defined schema • Any change in schema might require a downtime
  • 69.
  • 70. Hbase concepts There are 3 types of servers in a master-slave type of HBase Architecture. They are HBase HMaster Server ZooKeeper • Region servers, these servers serve data for reads and write purposes. That means clients can directly communicate with HBase Region Servers while accessing data. • The HBase Master process handles the region assignment as well as DDL (create, delete tables) operations. And finally, a part of HDFS, Zookeeper.
  • 71. HMasterServer • The master server -Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task. • Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers. • Maintains the state of the cluster by negotiating the load balancing. • Is responsible for schema changes and other metadata operations such as creation of tables and column families.
  • 72. Regions • Regions are nothing but tables that are split up and spread across the region servers. Region server • The region servers have regions that -Communicate with the client and handle data-related operations. • Handle read and write requests for all the regions under it. • Decide the size of the region by following the region size thresholds.
  • 73. Zookeeper • Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc. • Zookeeper has nodes representing different region servers. Master servers use these nodes to discover available servers. • In addition to availability, the nodes are also used to track server failures or network partitions. • Clients communicate with region servers via zookeeper. • In pseudo and standalone modes, HBase itself will take care of zookeeper.
  • 74. Regions • Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset of a table’s rows. • A region is denoted by the table it belongs to. • Initially, a table comprises a single region, but as the size increases, after it crosses a configurable size threshold, it splits at a row boundary into two new regions of approximately equal size. • Until this first split happens, all loading will be against the single server hosting the original region.
  • 75. • As the table grows, the number of its regions grows. Regions are the units that get distributed over an HBase Server • In this way, a table that is too big for any one server can be carried by a cluster of servers with each node hosting a subset of the table’s total regions. • Load on a table gets distributed. • The online set of sorted regions comprises the table’s total content.
  • 76.
  • 77. • To maintain server state in the HBase Cluster, HBase uses ZooKeeper as a distributed coordination service. • Basically, which servers are alive and available is maintained by Zookeeper, and also it provides server failure notification. • Moreove, Zookeeper maintains guarantee common shared state.
  • 78. Clients There are a number of client options for interacting with an Hbase cluster. • Java • HBase, like Hadoop, is written in Java • MapReduce HBase classes and utilities in the org.apache.hadoop.hbase.mapreduce package facilitate using HBase as a source and/or sink in MapReduce jobs. • The TableInputFormat class makes splits on region boundaries. • The Hbase TableOutputFormat will write the result of reduce into HBase.
  • 79. • Hbase with Avro, REST, and Thrift interfaces. These are useful when the interacting application is written in a language other than Java. • In all cases, a Java server hosts an instance of the HBase client brokering application Avro, REST, and Thrift requests in and out of the HBase cluster. This extra work proxying requests and responses means these interfaces are slower than using the Java client directly.
  • 80. HBase Vs RDBMS Database Type HBase • HBase is the column-oriented database. On defining Column-oriented, each column is a contiguous unit of page. RDBMS • Whereas, RDBMS is row-oriented that means here each row is a contiguous unit of page. Schema-type • Schema of HBase is less restrictive, adding columns on the fly is possible. RDBMS Schema of RDBMS is more restrictive.
  • 81. Sparse Tables HBase • HBase is good with the Sparse table. RDBMS • Whereas, RDBMS is not optimized for sparse tables. Scale up/ Scale out HBase • HBase supports scale out. It means while we need memory processing power and more disk, we need to add new servers to the cluster rather than upgrading the present one. RDBMS • However, RDBMS supports scale up. That means while we need memory processing power and more disk, we need upgrade same server to a more powerful server, rather than adding new servers.
  • 82. Amount of data HBase • While here it does not depend on the particular machine but the number of machines. RDBMS • In RDBMS, on the configuration of the server, amount of data depends. Support HBase • For HBase, there is no built-in support. RDBMS • And, RDBMS has ACID support.
  • 83. Data type HBase • HBase supports both structured and nonstructural data. RDBMS • RDBMS is suited for structured data. Transaction integrity HBase • In HBase, there is no transaction guaranty. RDBMS • Whereas, RDBMS mostly guarantees transaction integrity. JOINs HBase • HBase supports JOINs. RDBMS • RDBMS does not support JOINs.
  • 84. Referential integrity HBase • While it comes to referential integrity, there is no in-built support. RDBMS • And, RDBMS, supports referential integrity.
  • 85. Bigsql • IBM Big SQL is a high performance massively parallel processing (MPP) SQL engine for Hadoop that makes querying enterprise data from the organization in an easy and secure experience.
  • 86. • A Big SQL query can quickly access a variety of data sources including HDFS, RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database connection or single query for best-in-class analytic capabilities. • With Big SQL, your organization can derive significant value from your enterprise data.
  • 87. • Big SQL provides tools to help you manage your system and your databases, and you can use popular analytic tools to visualize your data. • Big SQL includes several tools and interfaces that are largely comparable to tools and interfaces that exist with most relational database management systems.
  • 88. • Big SQL's robust engine executes complex queries for relational data and Hadoop data. • Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query execution. • Combining these with a massive parallel processing (MPP) engine helps distribute query execution across nodes in a cluster.
  • 89. • The Big SQL architecture uses the latest relational database technology from IBM. • The database infrastructure provides a logical view of the data (by allowing storage and management of metadata) and a view of the query compilation, plus the optimization and runtime environment for optimal SQL processing.
  • 90.
  • 91. • Applications connect on a specific node based on specific user configurations. • SQL statements are routed through this node, to Big SQL management node, or the coordinating node. • There can be one or many management nodes, but there is only one Big SQL management node. SQL statements are compiled and optimized to generate a parallel execution query plan.
  • 92. • Then, a runtime engine distributes task(query) to worker nodes on the compute node and manipulates the consumption and return of the result set. • The compute node is a node that can be a physical server or operating system. • The worker nodes can contain the temporary tables, the runtime execution, the readers and writers, and the data nodes. • The DataNode holds the data.
  • 93. • When a worker node receives a query, it dispatches special processes that know how to read and write HDFS data natively. • Big SQL uses native and Java open source– readers (and writers) that are able to ingest different file formats. • The Big SQL engine pushes predicates down to these processes so that they can, in turn, apply projection and selection closer to the data. These processes also transform input data into an appropriate format for consumption inside Big SQL.
  • 94. • All of these nodes can be on one Management Node, or each part on separate Management Node. • We can separate the Big SQL management node from the other Hadoop master nodes. • This arrangement can allow the Big SQL management node to have enough resources to store intermediate data from the Big SQL data nodes.