Dive into the world of Apache Hive with our insightful presentation covering a range of topics, including Hive introduction, dispelling misconceptions about Hive, exploring its features and origins, understanding the why behind Hive's existence, delving into its architecture and working principles, dissecting the data model employed in Hive, exploring different modes of operation, and weighing the advantages and disadvantages it brings. The presentation concludes with practical examples, demonstrating how to create tables in Hive, upload data, and execute queries within the Hadoop environment. Join us on a journey through the intricacies of Hive, unraveling its capabilities and applications in big data analytics
2. What is Hive?
• Apache Hive is a data warehouse software built on top of
Hadoop that facilitates reading, writing and managing
large datasets residing in distributed storage using SQL.
• Hive provides the necessary SQL abstraction so that SQL-
like queries can be integrated with the underlying Java
code without having to implement the queries in the
low-level Java API
• It allows structure to be projected onto data that is
already in storage.
3. • It can create schemas/table definitions that
point to data in Hadoop, turning unstructured
data into structured data.
• Helps to treat your data in Hadoop as Tables;
which can be partitioned and bucketed.
4. Hive is not
• A relational database
• A design for OnLine Transaction Processing
(OLTP)
• A language for real-time queries and row-level
updates
5. Features of hive
• Hive is fast and scalable.
• It provides SQL-like queries (i.e., HQL) that are
implicitly transformed to MapReduce or Spark
jobs.
• It is capable of analyzing large datasets stored in
HDFS.
• It can operate on compressed data stored in the
Hadoop ecosystem.
• It supports user-defined functions (UDFs) where
user can provide its functionality.
6. Hive Origination
• Hive originated as an internal project
in Facebook
• Later it was adopted in Apache as an
open source project
• Facebook deals with massive amount
of data (petabytes scale) and it needs
to perform more than 75k ad-hoc
queries on this massive amount of
data
7. Why Hive?
• Since the data is collected from multiple
servers and is of diverse nature, any RDBMS
system could not fit as probable solution
• Map Reduce could be a natural choice, but it
had its own limitations
10. 1. Execute Query: The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute
2. Get Plan: The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query
3. Get Metadata: The compiler sends metadata request to Metastore (any
database).
4. Send Metadata : Metastore sends metadata as a response to the compiler.
5. Send Plan : The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6. Execute Plan: The driver sends the execute plan to the execution engine.
7.Execute Job: Internally, the process of execution job is a MapReduce job.
The execution engine sends the job to JobTracker, which is in Name node and
it assigns this job to TaskTracker, which is in Data node. Here, the query
executes MapReduce job.
11. Data modeling
Tables
Partitions
buckets
Here tables are organized into partitions for
grouping same type of data based on partition
key
Partitions divided further into buckets based on
some other column
Tables in hive are created the same way it is
done in RDBMS
12. Different modes of Hive
• Hive can operate in two modes depending on
the size of data nodes in Hadoop.
• These modes are :
• Local mode
• Map reduce mode
13. Local Mode
• If the Hadoop installed under pseudo mode
with having one data node we use Hive in this
mode
• If the data size is smaller in term of limited to
single local machine, we can use this mode
• Processing will be very fast on smaller data
sets present in the local machine
14. Map Reduce mode
• If Hadoop is having multiple data nodes and
data is distributed across different node we
use Hive in this mode
• It will perform on large amount of data sets
and query going to execute in parallel way
• Processing of large data sets with better
performance can be achieved through this
mode
15. Advantages of hive
• Keeps queries running fast
• Takes very little time to write Hive query in
comparison to MapReduce code
• HiveQL is a declarative language like SQL
• Multiple users can simultaneously query the
data using Hive-QL.
• Very easy to write query including joins in Hive
• Simple to learn and use
16. Disadvantages of Hive
• It's not designed for Online transaction
processing (OLTP), it is only used for the
Online Analytical Processing (OLAP).
• Hive supports overwriting or apprehending
data, but not updates and deletes.
• Sub-queries are not supported, in Hive
17. Copying file from local system into
Hadoop environment
• Hdfs dfs –copyFromLocal (file path)
destination path
25. Number of movies per year
select startyear,count(*) as count from
movies where startyear > 2000 and
startyear < 2022 group by startyear
order by count;
26. Comedy movies
• select primarytitle,startyear,runtimeminutes,genres from
movies where array_contains(genres,"Comedy");
28. Upcoming horror movies
select * from movies where titletype = 'movie'
and startyear > 2021 and
array_contains(genres,"Horror");
29. Movies in 2021 with rating more than 9
select m.startyear,m.titletype,m.primarytitle,r.averagerating,m.genres from movies as
m join rating as r on m.tconst = r.tconst
where m.titletype = 'movie' and m.startyear = 2021 and r.averagerating > 9 ;
30. Action series with rating more than 9
select m.startyear,m.titletype,m.primarytitle,r.averagerating,m.genres from movies as
m join rating as r on m.tconst = r.tconst
where m.titletype = 'tvSeries' and r.averagerating > 9 and
array_contains(genres,"Action");