BDA R20 21NM - Summary Big Data Analytics

Hadoop ecosystem
• Hadoop ecosystem contains components
• like HDFS and HDFS components,
• MapReduce,
• YARN, Hive, Apache Pig, Apache HBase and
HBase components, HCatalog, Avro, Thrift,
Drill, Apache mahout, Sqoop, Apache Flume,
Ambari, Zookeeper and Apache OOzie

Hbase& Hcatalog
Apache Hbase:
• This is a Hadoop ecosystem component which is a distributed database
that was designed to store structured data in tables that could have billions
of row and millions of columns. HBase is scalable, distributed. HBase,
provide real-time access to read or write data in HDFS.
HCATALOG:
• It is a table and storage management layer for Hadoop. HCatalog supports
different components available in Hadoop ecosystems like MapReduce,
Hive, and Pig to easily read and write data from the cluster. HCatalog is a
key component of Hive that enables the user to store their data in any
format and structure.
• By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC
file formats.

Apache Mahout
• Mahout is open source framework for creating
scalable machine learning algorithm and data mining
library. Once data is stored in Hadoop HDFS, mahout
provides the data science tools to automatically find
meaningful patterns in those big data sets.
Algorithms of Mahout are:
• Clustering – Here it takes the item in particular class and
organizes them into naturally occurring groups, such that
item belonging to the same group are similar to each other.
• Collaborative filtering – It mines user behavior and makes
product recommendations (e.g. Amazon recommendations)
• Classifications – It learns from existing categorization and
then assigns unclassified items to the best category.
• Frequent pattern mining – It analyzes items in a group
(e.g. items in a shopping cart or terms in query session) and
then identifies which two items typically appear together.

Apache Sqoop & Flume
• Sqoop:
Imports data from external sources into related
Hadoop ecosystem components like HDFS,
Hbase or Hive.
It also exports data from Hadoop to other
external sources.
Sqoop works with relational databases such as
teradata, Netezza, oracle, MySQL.

Apache Flume
Flume efficiently collects, aggregate and moves a
large amount of data from its origin and sending it
back to HDFS.
It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows the
data flow from the source into Hadoop
environment.
It uses a simple extensible data model that allows
for the online analytic application.
Using Flume, we can get the data from multiple
servers immediately into hadoop.

Query languages for hadoop
• MapReduce (MR) is a criterion of Big Data
processing model with parallel and distributed
large datasets.
• This model knows difficult problems related to
low-level and batch nature of MR that gives rise
to an abstraction layer on the top of MR.
• Several High-Level MapReduce Query Languages
built on the top of MR provide more abstract
query languages and extend the MR
programming model

• These High-Level MapReduce Query Languages
remove the burden of MR programming away
from the developers and make a soft migration of
existing competences with SQL skills to Big Data.
• Common High-Level MapReduce Query
Languages built directly on the top of MR that
translate queries into executable native MR jobs.
• It evaluates the performance of the four
presented High-Level MapReduce Query
Languages: JAQL, Hive, Big SQL and Pig, with
regards to their insightful perspectives and ease
of programming.

Query languages for hadoop
• Pig, from Yahoo! and now incubating at
Apache, has an imperative language called Pig
Latin for performing operations on large data
files.
• Jaql, from IBM is a declarative query language
for JSON data.
• Hive, from Facebook is a data warehouse
system with a declarative query language that
is a hybrid of SQL and Hadoop streaming.

HIVE & PIG
Hive:
• The Hadoop ecosystem component, Apache Hive, is an
open source data warehouse system for querying and
analysing large datasets stored in Hadoop files.
• Hive do three main functions: data summarization, query,
and analysis.
• Hive use language called HiveQL (HQL), which is
similar to SQL.
• HiveQL automatically translates SQL-like queries
into MapReduce jobs which will execute on Hadoop.

Pig:
• Apache Pig is a high-level language platform
for analyzing and querying huge dataset that
are stored in HDFS.
• Pig as a component of Hadoop Ecosystem
uses PigLatin language.
• It is very similar to SQL.
• It loads the data, applies the required filters
and translate the data in the required format.

STREAM COMPUTING
• Big data stream computing is able to analyze and
process data in real time to gain an immediate insight,
and it is typically applied to the analysis of vast amount
of data in real time and to process them at a high
speed.
• A high-performance computer system that analyzes
multiple data streams from many sources live.
• The word stream in stream computing is used to mean
pulling in streams of data, processing the data and
streaming it back out as a single flow.
• Stream computing uses software algorithms that
analyzes the data in real time as it streams in to
(which)increase speed and accuracy when dealing with
data handling and analysis

• In a stream processing system, applications typically
act as continuous queries, ingesting data continuously,
analyzing and correlating the data, and generating a
stream of results.
• Applications are represented as data-flow graphs
composed of operators and interconnected by streams.
• The individual operators implement algorithms for data
analysis, such as parsing, filtering, feature extraction,
and classification.
• Such algorithms are typically single-pass because of the
high data rates of external feeds (e.g., market
information from stock exchanges, environmental
sensors readings from sites in a forest, etc.).

• IBM announced its stream computing system,
called System S.
• ATI Technologies also announced a stream
computing technology that describes its
technology that enables the graphics
processors (GPUs) to work in conjunction with
high-performance, low-latency CPUs to solve
complex computational problems.

PIG
• Pig raises the level of abstraction for processing large datasets.
• With Pig, the data structures are much richer, typically being
multivalued and nested; and the set of transformations you can
apply to the data are much more powerful
• Pig Latin, a Parallel Data Flow Language. Pig Latin is a data flow
language. This means it allows users to describe how data from one
or more inputs should be read, processed, and then stored to one
or more outputs in parallel.
Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There
are currently two environments: local execution in a single
JVM and distributed execution on a Hadoop cluster.

Apache Pig Components
There are several components in the Apache Pig framework.
Parser
• At first, all the Pig Scripts are handled by the Parser. Parser basically
checks the syntax of the script, does type checking, and other
miscellaneous checks. Afterwards, Parser’s output will be a DAG
(directed acyclic graph) that represents the Pig Latin statements as
well as logical operators.
The logical operators of the script are represented as the nodes and
the data flows are represented as edges in DAG (the logical plan)
Optimizer
• Afterwards, the logical plan (DAG) is passed to the logical optimizer.
It carries out the logical optimizations further such as projection
and push down.

Compiler
• Then compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
• Eventually, all the MapReduce jobs are
submitted to Hadoop in a sorted order.
• Ultimately, it produces the desired results

Important points
• A Pig Latin program is made up of a series of
operations, or transformations, that are applied to the
input data to produce output.
• Pig transforms your query into series of mapreduce
task and you unaware of this.
• You will focus on the data and you dont know nature of
execution.
• Pig is a scripting language for exploring large datasets.
• Pig’s sweet spot is its ability to process terabytes of
data simply by using only half-dozen lines of Pig Latin
from the console.

• Pig was designed to be extensible. Virtually all parts of
the processing path are customizable: loading, storing,
filtering, grouping, and joining can all be altered by
userdefined functions (UDFs).
• As another benefit, UDFs tend to be more reusable
than the libraries developed for writing MapReduce
programs.
• Pig isn’t suitable for all data processing tasks.
• If you want to perform a query that touches only a
small amount of data in a large dataset, then Pig will
not perform well, since it is set up to scan the whole
dataset, or at least large portions of it.

Execution Types
Pig has two execution types or modes:
• local mode and
• MapReduce mode.
Local mode:
• In local mode, Pig runs in a single JVM and accesses the local
filesystem. This mode is suitable only for small datasets and when
trying out Pig.
• The execution type is set using the -x or -exectype option. To run in
local mode, set the option to local.
• % pig -x local
• grunt>
• This starts Grunt, the Pig interactive shell.

MapReduce mode
• In MapReduce mode, Pig translates queries into
MapReduce jobs and runs them on a Hadoop
cluster.
• The cluster may be a pseudo or fully distributed
cluster.
• To use MapReduce mode, you first need to check
that the version of Pig you downloaded is
compatible with the version of Hadoop you are
using. Pig releases will only work against
particular versions of Hadoop.

Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
Script:
Pig can run a script file that contains Pig commands. For example, pig script.pig
runs the commands in the local file script.pig. Alternatively, for very short scripts,
you can use the -e option to run a script specified as a string on the command line.
Grunt:
• Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to run
Pig scripts from within Grunt using run and exec.
Embedded:
• You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.

• Interactive Mode (Grunt shell) − You can run
Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump
operator).
• Batch Mode (Script) − You can run Apache Pig in
Batch mode by writing the Pig Latin script in a
single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides
the provision of defining our own functions
(User Defined Functions) in programming
languages such as Java, and using them in our
script.

• We can run your Pig scripts in the shell after
invoking the Grunt shell. Moreover, there are
certain useful shell and utility commands
offered by the Grunt shell.

PigLatin
• Apache Pig offers High-level language like Pig
Latin to perform data analysis programs.
• A Pig Latin program consists of a collection of
statements. A statement can be thought of as
an operation, or a command. For example, a
GROUP operation is a type of statement.
grouped_records = GROUP records BY year;
Statements are usually terminated with a semicolon, as in the example of the
GROUPstatement. In fact, this is an example of a statement that must be
terminated with a semicolon: it is a syntax error to omit it. In Grunt no error

• While we need to analyze data in Hadoop
using Apache Pig, we use Pig Latin language.
• Basically, first, we need to transform Pig Latin
statements into MapReduce jobs using an
interpreter layer. In this way, the Hadoop
process these jobs.
• Pig Latin is a very simple language with SQL
like semantics.
• It is possible to use it in a productive manner.

• It also contains a rich set of functions.
• Those exhibits data manipulation.
• Moreover, by writing user-defined functions
(UDF) using Java, we can extend them easily.
• That implies they are extensible in nature.

Data Model in Pig Latin
• The data model of Pig is fully nested. In
addition, the outermost structure of the Pig
Latin data model is a Relation. Also, it is a bag.
While−
• A bag, what we call a collection of tuples.
• A tuple, what we call an ordered set of fields.
• A field, what we call a piece of data.

Statements in Pig Latin
• Also, make sure, statements are the basic constructs while
processing data using Pig Latin.
• Basically, statements work with relations. Also, includes
expressions and schemas.
• Here, every statement ends with a semicolon (;).
• Moreover, through statements, we will perform several
operations using operators, those are offered by Pig Latin.
• However, Pig Latin statements take a relation as input and
produce another relation as output, while performing all
other operations Except LOAD and STORE.
• Its semantic checking will be carried out, once we enter a
Load statement in the Grunt shell.
• Although, we need to use the Dump operator, in order to
see the contents of the schema.
• Because, the MapReduce job for loading the data into the
file system will be carried out, only after performing the
dump operation.

Pig Latin Datatypes
• int
• “Int” represents a signed 32-bit integer.
For Example: 10
• long
• It represents a signed 64-bit integer.
For Example: 10L
• float
• This data type represents a signed 32-bit floating point.
For Example: 10.5F
• double
• “double” represents a 64-bit floating point.
For Example: 10.5
• chararray
• It represents a character array (string) in Unicode UTF-8 format.
For Example: ‘Data Flair’
• Bytearray
• This data type represents a Byte array (blob).

• Boolean
• “Boolean” represents a Boolean value.
For Example : true/ false.
Note: It is case insensitive.
• Datetime
• It represents a date-time.
For Example : 1970-01-01T00:00:00.000+00:00
• Biginteger
• This data type represents a Java BigInteger.
For Example: 60708090709
• Bigdecimal
• “Bigdecimal” represents a Java BigDecimal
For Example: 185.98376256272893883

Complex Types
• Tuple
• Bag
• Map
Pig Latin Operators
Arithmetic Operators
Comparison Operators
Type Construction Operators

Data Processing Operators
Loading and Storing
• LOAD
It loads the data from a file system into a relation.
• STORE
It stores a relation to the file system (local/HDFS).
Filtering
• FILTER
There is a removal of unwanted rows from a relation.
• DISTINCT
We can remove duplicate rows from a relation by this
operator.
• FOREACH, GENERATE
It transforms the data based on the columns of data.
• STREAM
To transform a relation using an external program.

• Diagnostic Operators
• DUMP
It prints the content of a relationship through the
console.
• DESCRIBE
It describes the schema of a relation.
• EXPLAIN
We can view the logical, physical execution plans to
evaluate a relation.
• ILLUSTRATE
It displays all the execution steps as the series of
statements.

Grouping and Joining
• JOIN We can join two or more relations.
• COGROUP There is a grouping of the data into two or more
relations.
• GROUPIt groups the data in a single relation.
• CROSSWe can create the cross product of two or more relations.
Sorting
• ORDER
It arranges a relation in an order based on one or more fields.
• LIMIT
We can get a particular number of tuples from a relation.
Combining and Splitting
• UNION
We can combine two or more relations into one relation.
• SPLIT
To split a single relation into more relations.

Hive
• Apache Hive is an open source data warehouse system
built on top of Hadoop ,for querying and analyzing
large datasets stored in Hadoop files.
• It process structured and semi-structured data in
Hadoop.
• Hive runs on your workstation and converts your SQL
query into a series of Map Reduce jobs for execution
on a Hadoop cluster.
• Hive organizes data into tables, which provide a means
for attaching structure to data stored in HDFS.
• Metadata such as table schemas is stored in a
database called the meta store.

Metastore
• It stores metadata for each and every table
• Hive also includes the partition metadata.
• This helps the driver to track the progress of
various data sets distributed over the cluster.
• It stores the data in a traditional RDBMS
format.
• Backup server regularly replicates the data
which it can retrieve in case of data loss.

Driver
• It acts like a controller which receives the HiveQL
statements.
• The driver starts the execution of the statement
by creating sessions.
• It monitors the life cycle and progress of the
execution.
• Driver stores the necessary metadata generated
during the execution of a HiveQL statement.
• It also acts as a collection point of data or query
result obtained after the Reduce operation

Compiler
• It performs the compilation of the HiveQL query.
• This converts the query to an execution plan. The
plan contains the tasks.
• It also contains steps needed to be performed by
the MapReduce to get the output as translated by
the query.
• The compiler in Hive converts the query to
an Abstract Syntax Tree (AST).
• First, check for compatibility and compile-time
errors, then converts the AST to a Directed
Acyclic Graph (DAG).

• Optimizer – It performs various
transformations on the execution plan to
provide optimized DAG. It aggregates the
transformations together, such as converting a
pipeline of joins to a single join, for better
performance. The optimizer can also split the
tasks, such as applying a transformation on
data before a reduce operation, to provide
better performance.
• Executor – Once compilation and optimization
complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.

• CLI, UI, and Thrift Server –
• CLI (command-line interface) provides a user
interface for an external user to interact with
Hive.
• Thrift server in Hive allows external clients to
interact with Hive over a network, similar to
the JDBC or ODBC protocols.

Hive Shell
• The shell is the primary way that we will interact with Hive, by
issuing commands in HiveQL.
• HiveQL is Hive’s query language, a dialect of SQL. It is heavily
influenced by MySQL
• When starting Hive for the first time, we can check that it is working
by listing its tables The command must be terminated with a
semicolon to tell Hive to execute it:
• Like SQL, HiveQL is generally case insensitive (except for string
comparisons), so show tables; works equally well here. The tab key
will auto complete Hive keywords and functions.
• For a fresh install, the command takes a few seconds to run since it
is lazily creating the metastore database on your machine. (The
database stores its files in a directory called metastore_db, which is
relative to where you ran the hive command from.)
hive> SHOW TABLES;
OK
Time taken: 10.425 seconds

• Hive Client
• Hive Services
• Processing and Resource Management
• Distributed Storage

Hive Client
• Hive supports applications written in any language like
Python, Java, C++, Ruby, etc. Using JDBC, ODBC, and Thrift
drivers, for performing queries on the Hive. Hence, one can
easily write a hive client application in any language of its
own choice.
• Hive clients are categorized into three types:
1. Thrift Clients
• The Hive server is based on Apache Thrift so that it can
serve the request from a thrift client.
2. JDBC client
• Hive allows for the Java applications to connect to it using
the JDBC driver. JDBC driver uses Thrift to communicate
with the Hive Server.
3. ODBC client
• Hive ODBC driver allows applications based on the ODBC
protocol to connect to Hive. Similar to the JDBC driver, the
ODBC driver uses Thrift to communicate with the Hive
Server.

Hive Service
cli
• The command line interface to Hive (the shell). This is the default service.
• To perform all queries, Hive provides various services like the Hive server2,
Beeline, etc. The various services offered by Hive are:
Hive sever
• HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients
to execute queries against the Hive. It allows multiple clients to submit
requests to Hive and retrieve the final results. It is basically designed to
provide the best support for open API clients like JDBC and ODBC.
hwi
• The Hive Web Interface.
jar
• The Hive equivalent to hadoop jar, a convenient way to run Java
applications that
• includes both Hadoop and Hive classes on the classpath.

Meta Store
• Metastore is a central repository that stores the metadata
information about the structure of tables and partitions, including
column and column type information.
• It also stores information of serializer and deserializer, required for
the read/write operation, and HDFS files where data is stored. This
metastore is generally a relational database.
• Metastore provides a Thrift interface for querying and manipulating
Hive metadata.
We can configure metastore in any of the two modes:
• Remote: In remote mode, metastore is a Thrift service and is useful
for non-Java applications.
• Embedded: In embedded mode, the client can directly interact with
the metastore using JDBC.

Embedded
• The metastore is the central repository of Hive metadata.
The metastore is divided into two pieces: a service and the
backing store for the data. By default, the metastore service
runs in the same JVM as the Hive service and contains an
embedded database instance backed by the local disk. This
is called the embedded metastore configuration
• Using an embedded metastore is a simple way to get
started with Hive; however, only one embedded Derby
database can access the database files on disk at any one
time, which means you can only have one Hive session
open at a time that shares the same metastore. Trying to
start a second session gives the error:
• Failed to start database 'metastore_db‘
• when it attempts to open a connection to the metastore

• The solution to supporting multiple sessions (and
therefore multiple users) is to use a standalone
database. This configuration is referred to as a
local metastore, since the metastore service still
runs in the same process as the Hive service, but
connects to a database running in a separate
process, either on the same machine or on a
remote machine.
• There’s another metastore configuration called a
remote metastore, where one or more metastore
servers run in separate processes to the Hive
service. This brings better manageability and
security, since the database tier can be
completely firewalled off, and the clients no
longer need the database credentials.

HiveQL
• The Hive Query Language (HiveQL) is a query
language for Hive to process and analyze
structured data in a Metastore.
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Creating Data Base:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

Opertators
• Relational Operators
• Arithmetic Operators
• Logical Operators
• String Operators
• Operators on Complex Types

Hive DDL commands
• Hive DDL commands are the statements used for defining
and changing the structure of a table or database in Hive. It
is used to build or modify the tables and other objects in
the database.
• The several types of Hive DDL commands are:
• CREATE
• SHOW
• DESCRIBE
• USE
• DROP
• ALTER
• TRUNCATE

Hive DML Commands
• Hive DML (Data Manipulation Language) commands are
used to insert, update, retrieve, and delete data from the
Hive table once the table and database schema has been
defined using Hive DDL commands.
• The various Hive DML commands are:
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT

Joins
• Inner join in Hive
• Left Outer Join in Hive
• Right Outer Join in Hive
• Full Outer Join in Hive

Partition
• Apache Hive organizes tables into partitions. Partitioning is
a way of dividing a table into related parts based on the
values of particular columns like date, city, and department.
• Each table in the hive can have one or more partition keys
to identify a particular partition. Using partition it is easy to
do queries on slices of the data.
• Here are two types of Partitioning in Apache Hive-
Static Partitioning
Dynamic Partitioning
• In Apache Hive for decomposing table data sets into more
manageable parts, it uses Hive Bucketing concept.
However, there are much more to learn about Bucketing in
Hive.

HBase
Hbasics
• HBase is a distributed column-oriented database built on
top of HDFS.
• HBase is the Hadoop application to use when you require
real-time read/write random-access to very large datasets.
Why Hbase:
• RDBMS get exponentially slow as the data becomes large
• Expects data to be highly structured, i.e. ability to fit in a
well-defined schema
• Any change in schema might require a downtime

Hbase concepts
There are 3 types of servers in a master-slave type of
HBase Architecture. They are
HBase HMaster
Server
ZooKeeper
• Region servers, these servers serve data for reads and
write purposes. That means clients can directly
communicate with HBase Region Servers while
accessing data.
• The HBase Master process handles the region
assignment as well as DDL (create, delete tables)
operations. And finally, a part of HDFS, Zookeeper.

HMasterServer
• The master server -Assigns regions to the region
servers and takes the help of Apache ZooKeeper
for this task.
• Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families.

Regions
• Regions are nothing but tables that are split up and
spread across the region servers.
Region server
• The region servers have regions that -Communicate
with the client and handle data-related operations.
• Handle read and write requests for all the regions
under it.
• Decide the size of the region by following the region
size thresholds.

Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has nodes representing different region
servers. Master servers use these nodes to discover
available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take
care of zookeeper.

Regions
• Tables are automatically partitioned horizontally by
HBase into regions. Each region comprises a subset
of a table’s rows.
• A region is denoted by the table it belongs to.
• Initially, a table comprises a single region, but as the
size increases, after it crosses a configurable size
threshold, it splits at a row boundary into two new
regions of approximately equal size.
• Until this first split happens, all loading will be
against the single server hosting the original region.

• As the table grows, the number of its regions
grows. Regions are the units that get
distributed over an HBase Server
• In this way, a table that is too big for any one
server can be carried by a cluster of servers
with each node hosting a subset of the table’s
total regions.
• Load on a table gets distributed.
• The online set of sorted regions comprises the
table’s total content.

• To maintain server state in the HBase Cluster,
HBase uses ZooKeeper as a distributed
coordination service.
• Basically, which servers are alive and available
is maintained by Zookeeper, and also it
provides server failure notification.
• Moreove, Zookeeper maintains guarantee
common shared state.

Clients
There are a number of client options for interacting
with an Hbase cluster.
• Java
• HBase, like Hadoop, is written in Java
• MapReduce HBase classes and utilities in the
org.apache.hadoop.hbase.mapreduce package
facilitate using HBase as a source and/or sink in
MapReduce jobs.
• The TableInputFormat class makes splits on
region boundaries.
• The Hbase TableOutputFormat will write the
result of reduce into HBase.

• Hbase with Avro, REST, and Thrift interfaces.
These are useful when the interacting
application is written in a language other than
Java.
• In all cases, a Java server hosts an instance of
the HBase client brokering application Avro,
REST, and Thrift requests in and out of the
HBase cluster. This extra work proxying
requests and responses means these
interfaces are slower than using the Java client
directly.

HBase Vs RDBMS
Database Type
HBase
• HBase is the column-oriented database. On
defining Column-oriented, each column is a
contiguous unit of page.
RDBMS
• Whereas, RDBMS is row-oriented that means
here each row is a contiguous unit of page.
Schema-type
• Schema of HBase is less restrictive, adding
columns on the fly is possible. RDBMS
Schema of RDBMS is more restrictive.

Sparse Tables
HBase
• HBase is good with the Sparse table.
RDBMS
• Whereas, RDBMS is not optimized for sparse tables.
Scale up/ Scale out
HBase
• HBase supports scale out. It means while we need memory
processing power and more disk, we need to add new servers to
the cluster rather than upgrading the present one.
RDBMS
• However, RDBMS supports scale up. That means while we need
memory processing power and more disk, we need upgrade same
server to a more powerful server, rather than adding new servers.

Amount of data
HBase
• While here it does not depend on the particular
machine but the number of machines.
RDBMS
• In RDBMS, on the configuration of the server,
amount of data depends.
Support
HBase
• For HBase, there is no built-in support.
RDBMS
• And, RDBMS has ACID support.

Data type
HBase
• HBase supports both structured and nonstructural data.
RDBMS
• RDBMS is suited for structured data.
Transaction integrity
HBase
• In HBase, there is no transaction guaranty.
RDBMS
• Whereas, RDBMS mostly guarantees transaction integrity.
JOINs
HBase
• HBase supports JOINs.
RDBMS
• RDBMS does not support JOINs.

Referential integrity
HBase
• While it comes to referential integrity, there is
no in-built support.
RDBMS
• And, RDBMS, supports referential integrity.

Bigsql
• IBM Big SQL is a high performance massively
parallel processing (MPP) SQL engine for
Hadoop that makes querying enterprise data
from the organization in an easy and secure
experience.

• A Big SQL query can quickly access a variety of
data sources including HDFS, RDBMS, NoSQL
databases, object stores, and WebHDFS by
using a single database connection or single
query for best-in-class analytic capabilities.
• With Big SQL, your organization can derive
significant value from your enterprise data.

• Big SQL provides tools to help you manage
your system and your databases, and you can
use popular analytic tools to visualize your
data.
• Big SQL includes several tools and interfaces
that are largely comparable to tools and
interfaces that exist with most relational
database management systems.

• Big SQL's robust engine executes complex
queries for relational data and Hadoop data.
• Big SQL provides an advanced SQL compiler
and a cost-based optimizer for efficient query
execution.
• Combining these with a massive parallel
processing (MPP) engine helps distribute
query execution across nodes in a cluster.

• The Big SQL architecture uses the latest
relational database technology from IBM.
• The database infrastructure provides a logical
view of the data (by allowing storage and
management of metadata) and a view of the
query compilation, plus the optimization and
runtime environment for optimal SQL
processing.

• Applications connect on a specific node based
on specific user configurations.
• SQL statements are routed through this node,
to Big SQL management node, or the
coordinating node.
• There can be one or many management
nodes, but there is only one Big SQL
management node. SQL statements are
compiled and optimized to generate a parallel
execution query plan.

• Then, a runtime engine distributes task(query)
to worker nodes on the compute node and
manipulates the consumption and return of
the result set.
• The compute node is a node that can be a
physical server or operating system.
• The worker nodes can contain the temporary
tables, the runtime execution, the readers and
writers, and the data nodes.
• The DataNode holds the data.

• When a worker node receives a query, it
dispatches special processes that know how to
read and write HDFS data natively.
• Big SQL uses native and Java open source–
readers (and writers) that are able to ingest
different file formats.
• The Big SQL engine pushes predicates down to
these processes so that they can, in turn,
apply projection and selection closer to the
data. These processes also transform input
data into an appropriate format for
consumption inside Big SQL.

• All of these nodes can be on
one Management Node, or each part on
separate Management Node.
• We can separate the Big SQL management
node from the other Hadoop master nodes.
• This arrangement can allow the Big SQL
management node to have enough resources
to store intermediate data from the Big SQL
data nodes.

BDA R20 21NM - Summary Big Data Analytics

Recommended

Recommended

More Related Content

Similar to BDA R20 21NM - Summary Big Data Analytics

Similar to BDA R20 21NM - Summary Big Data Analytics (20)

Recently uploaded

Recently uploaded (20)

BDA R20 21NM - Summary Big Data Analytics