Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem

Hive
Presented by : Mohammad
Mashhoood Syed

What is Hive?
• Apache Hive is a data warehouse software built on top of
Hadoop that facilitates reading, writing and managing
large datasets residing in distributed storage using SQL.
• Hive provides the necessary SQL abstraction so that SQL-
like queries can be integrated with the underlying Java
code without having to implement the queries in the
low-level Java API
• It allows structure to be projected onto data that is
already in storage.

• It can create schemas/table definitions that
point to data in Hadoop, turning unstructured
data into structured data.
• Helps to treat your data in Hadoop as Tables;
which can be partitioned and bucketed.

Hive is not
• A relational database
• A design for OnLine Transaction Processing
(OLTP)
• A language for real-time queries and row-level
updates

Features of hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are
implicitly transformed to MapReduce or Spark
jobs.
• It is capable of analyzing large datasets stored in
HDFS.
• It can operate on compressed data stored in the
Hadoop ecosystem.
• It supports user-defined functions (UDFs) where
user can provide its functionality.

Hive Origination
• Hive originated as an internal project
in Facebook
• Later it was adopted in Apache as an
open source project
• Facebook deals with massive amount
of data (petabytes scale) and it needs
to perform more than 75k ad-hoc
queries on this massive amount of
data

Why Hive?
• Since the data is collected from multiple
servers and is of diverse nature, any RDBMS
system could not fit as probable solution
• Map Reduce could be a natural choice, but it
had its own limitations

1. Execute Query: The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute
2. Get Plan: The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query
3. Get Metadata: The compiler sends metadata request to Metastore (any
database).
4. Send Metadata : Metastore sends metadata as a response to the compiler.
5. Send Plan : The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6. Execute Plan: The driver sends the execute plan to the execution engine.
7.Execute Job: Internally, the process of execution job is a MapReduce job.
The execution engine sends the job to JobTracker, which is in Name node and
it assigns this job to TaskTracker, which is in Data node. Here, the query
executes MapReduce job.

Data modeling
Tables
Partitions
buckets
Here tables are organized into partitions for
grouping same type of data based on partition
key
Partitions divided further into buckets based on
some other column
Tables in hive are created the same way it is
done in RDBMS

Different modes of Hive
• Hive can operate in two modes depending on
the size of data nodes in Hadoop.
• These modes are :
• Local mode
• Map reduce mode

Local Mode
• If the Hadoop installed under pseudo mode
with having one data node we use Hive in this
mode
• If the data size is smaller in term of limited to
single local machine, we can use this mode
• Processing will be very fast on smaller data
sets present in the local machine

Map Reduce mode
• If Hadoop is having multiple data nodes and
data is distributed across different node we
use Hive in this mode
• It will perform on large amount of data sets
and query going to execute in parallel way
• Processing of large data sets with better
performance can be achieved through this
mode

Advantages of hive
• Keeps queries running fast
• Takes very little time to write Hive query in
comparison to MapReduce code
• HiveQL is a declarative language like SQL
• Multiple users can simultaneously query the
data using Hive-QL.
• Very easy to write query including joins in Hive
• Simple to learn and use

Disadvantages of Hive
• It's not designed for Online transaction
processing (OLTP), it is only used for the
Online Analytical Processing (OLAP).
• Hive supports overwriting or apprehending
data, but not updates and deletes.
• Sub-queries are not supported, in Hive

Copying file from local system into
Hadoop environment
• Hdfs dfs –copyFromLocal (file path)
destination path

Number of movies per year
select startyear,count(*) as count from
movies where startyear > 2000 and
startyear < 2022 group by startyear
order by count;

Comedy movies
• select primarytitle,startyear,runtimeminutes,genres from
movies where array_contains(genres,"Comedy");

• select distinct titletype from movies;

Upcoming horror movies
select * from movies where titletype = 'movie'
and startyear > 2021 and
array_contains(genres,"Horror");

Movies in 2021 with rating more than 9
select m.startyear,m.titletype,m.primarytitle,r.averagerating,m.genres from movies as
m join rating as r on m.tconst = r.tconst
where m.titletype = 'movie' and m.startyear = 2021 and r.averagerating > 9 ;

Action series with rating more than 9
select m.startyear,m.titletype,m.primarytitle,r.averagerating,m.genres from movies as
m join rating as r on m.tconst = r.tconst
where m.titletype = 'tvSeries' and r.averagerating > 9 and
array_contains(genres,"Action");

Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem

Recommended

Recommended

More Related Content

Similar to Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem

Similar to Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem (20)

Recently uploaded

Recently uploaded (20)

Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem