SlideShare a Scribd company logo
1 of 13
New Big Data Architecture
• A big data architecture is designed to handle
the ingestion,
• processing,
• and analysis of data that is too large or
complex for traditional database systems.
Involves types of workload:
• Batch processing of big data sources at
rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data
technologies and tools.
• Predictive analytics and machine learning.
New Big data architecture
components:
1.Data sources: All big data solutions start with one or
more data sources. Examples include:
• Application data stores, such as relational
databases.
• Static files produced by applications, such as web
server log files.
• Real-time data sources, such as IoT devices.
2.Data storage:
• Data for batch processing operations is
typically stored in a distributed file store
that can hold high volumes of large files in
various formats.
• This kind of store is often called a data
lake.
• Options for implementing this storage
include Azure Data Lake Store or blob
containers in Azure Storage.
3.Batch processing:
• Because the data sets are so large, often a big data
solution must process data files using long-running
batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files,
processing them, and writing the output to new files.
• Options include running U-SQL jobs in Azure Data
Lake Analytics, using Hive, Pig, or custom
Map/Reduce jobs in an HDInsight Hadoop cluster, or
using Java, Scala, or Python programs in an
HDInsight Spark cluster.
4.Real-time message ingestion:
• If the solution includes real-time sources, the
architecture must include a way to capture and store
real-time messages for stream processing.
• This might be a simple data store, where incoming
messages are dropped into a folder for processing.
• However, many solutions need a message
ingestion store to act as a buffer for messages, and
to support scale-out processing, reliable delivery,
and other message queuing semantics.
5.Stream processing:
• After capturing real-time messages, the solution
must process them by filtering, aggregating, and
otherwise preparing the data for analysis.
• The processed stream data is then written to an
output sink. Azure Stream Analytics provides a
managed stream processing service based on
perpetually running SQL queries that operate on
unbounded streams.
• You can also use open source Apache streaming
technologies like Spark Streaming in an HDInsight
cluster.
6. Analytical data store:
• Many big data solutions prepare data for analysis
and then serve the processed data in a structured
format that can be queried using analytical tools.
7.Analysis and reporting:
• The goal of most big data solutions is to provide
insights into the data through analysis and reporting.
• To empower users to analyze the data, the
architecture may include a data modeling layer, such
as a multidimensional OLAP cube or tabular data
model in Azure Analysis Services.
• Analysis and reporting can also take the form of
interactive data exploration by data scientists or data
analysts.
8.Orchestration:
• Most big data solutions consist of repeated data
processing operations, encapsulated in
workflows, that transform source data, move
data between multiple sources and sinks, load
the processed data into an analytical data store,
or push the results straight to a report or
dashboard.
• To automate these workflows, you can use an
orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
Apache HIVE
• The Hadoop ecosystem contains different sub-projects
(tools) such as Sqoop, Pig, and Hive that are used to
help Hadoop modules.
• Sqoop: It is used to import and export data to and from
between HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop
a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type scripts to
do MapReduce operations.
What is HIVE
• Hive is a data warehouse infrastructure
tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize
Big Data, and makes querying and
analyzing easy.
• Initially Hive was developed by Facebook,
later the Apache Software Foundation took
it up and developed it further as an open
source under the name Apache Hive. It is
used by different companies.
Hive is not
• A relational database
• A design for OnLine Transaction Processing
(OLTP)
• A language for real-time queries and row-level
updates
Features of Hive
• It stores schema in a database and processed
data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying
called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

More Related Content

Similar to New big data architecture in hadoop.pptx

An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
iaeronlineexm
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 

Similar to New big data architecture in hadoop.pptx (20)

Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big Data
Big DataBig Data
Big Data
 

Recently uploaded

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 

Recently uploaded (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

New big data architecture in hadoop.pptx

  • 1. New Big Data Architecture • A big data architecture is designed to handle the ingestion, • processing, • and analysis of data that is too large or complex for traditional database systems.
  • 2. Involves types of workload: • Batch processing of big data sources at rest. • Real-time processing of big data in motion. • Interactive exploration of big data technologies and tools. • Predictive analytics and machine learning.
  • 3.
  • 4. New Big data architecture components: 1.Data sources: All big data solutions start with one or more data sources. Examples include: • Application data stores, such as relational databases. • Static files produced by applications, such as web server log files. • Real-time data sources, such as IoT devices.
  • 5. 2.Data storage: • Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. • This kind of store is often called a data lake. • Options for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
  • 6. 3.Batch processing: • Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. • Usually these jobs involve reading source files, processing them, and writing the output to new files. • Options include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
  • 7. 4.Real-time message ingestion: • If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing. • This might be a simple data store, where incoming messages are dropped into a folder for processing. • However, many solutions need a message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics.
  • 8. 5.Stream processing: • After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. • The processed stream data is then written to an output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually running SQL queries that operate on unbounded streams. • You can also use open source Apache streaming technologies like Spark Streaming in an HDInsight cluster.
  • 9. 6. Analytical data store: • Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. 7.Analysis and reporting: • The goal of most big data solutions is to provide insights into the data through analysis and reporting. • To empower users to analyze the data, the architecture may include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. • Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts.
  • 10. 8.Orchestration: • Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. • To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
  • 11. Apache HIVE • The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules. • Sqoop: It is used to import and export data to and from between HDFS and RDBMS. • Pig: It is a procedural language platform used to develop a script for MapReduce operations. • Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
  • 12. What is HIVE • Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. • Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies.
  • 13. Hive is not • A relational database • A design for OnLine Transaction Processing (OLTP) • A language for real-time queries and row-level updates Features of Hive • It stores schema in a database and processed data into HDFS. • It is designed for OLAP. • It provides SQL type language for querying called HiveQL or HQL. • It is familiar, fast, scalable, and extensible.