DBMS Schemas for Decision Support , Star Schema, Snowflake Schema, Fact Constellation Schema, Schema Definition, Data extraction, clean up and transformation tools.
2. DBMS Schemas for Decision Support
Schema is a logical description of the entire database.
It includes the name and description of records of all
record types including all associated data-items and
aggregates.
Much like a database, a data warehouse also requires to
maintain a schema.
A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema.
3. Star Schema
• Each dimension in a star schema is represented with
only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a
company with respect to the four dimensions,
namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys
to each of four dimensions.
• The fact table also contains the attributes, namely
dollars sold and units sold.
5. Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are
normalized.
• For example, the item dimension table in star schema is normalized
and split into two dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The
supplier dimension table contains the attributes supplier_key and
supplier_type.
7. Fact Constellation Schema
A fact constellation has multiple fact tables. It is also known as galaxy
schema.
The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and
units sold.
It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.
9. Schema Definition
Multidimensional schema is defined using Data Mining
Query Language (DMQL).
The two primitives, cube definition and dimension
definition, can be used for defining the data warehouses
and data marts.
10. Data extraction, clean up and
transformation tools
1. Tools requirements:
The tools enable sourcing of the proper data contents and formats
from operational and external data stores into the data warehouse.
The task includes:
Data transformation from one format to another
Data transformation and calculation based on the application of the
business rules. Eg : age from date of birth.
Data consolidation (several source records into single records) and
integration
11. Data extraction, clean up and
transformation tools
Meta data synchronizations and management include storing or updating
metadata definitions.
When implementing datawarehouse, several selections criteria that affect the
tools ability to transform, integrate and repair the data should be considered.
The ability to identify the data source
Support for flat files, Indexed files
Ability to merge the data from multiple data source
Ability to read information from data dictionaries
The code generated tool should be maintained in the development
environment
The ability to perform data type and character set translation is requirement
when moving data between incompatible systems.
12. Data extraction, clean up and
transformation tools
The ability to summarization and aggregations of records
The data warehouse database management system should be able to perform
the load directly from the tool using the native API.
2. Vendor approaches:
The tasks of capturing data from a source data system, cleaning transforming
it and the loading the result into a target data system.
It can be a carried out either by separate product or by single integrated
solutions. the integrated solutions are described below:
Code generators:
Create tailored 3GL/4GL transformation program based on source and target
data definitions.
The data transformations and enhancement rules defined by developer and it
employ data manipulation language.
13. Data extraction, clean up and
transformation tools
Database data Replication tools:
It employs changes to a single data source on one system and apply the
changes to a copy of the source data that are loaded on a different systems.
Rule driven dynamic transformations engines (also known as data mart
builders)
Capture the data from a source system at user defined interval, transforms the
data, then send and load the result in to a target systems.
Data Transformation and enhancement is based on a script or function logic
defined to the tool.
14. Data extraction, clean up and
transformation tools
3. Access to legacy data:
Today many businesses are adopting client/server technologies and data
warehousing to meet customer demand for new products and services to
obtain competitive advantages.
Majority of information required supporting business application and
analytical power of data warehousing is located behind mainframe based
legacy systems.
While many organizations protecting their heavy financial investment in
hardware and software to meet this goal many organization turn to
middleware solutions.
Middleware strategy is the foundation for the enterprise/access. it is designed
for scalability and manageability in a data warehousing environment.
15. Data extraction, clean up and
transformation tools
4. Vendor solutions :
4.1 Prism solutions:
Prism manager provides a solution for data warehousing by mapping source
data to target database management system.
The prism warehouse manager generates code to extract and integrate data,
create and manage metadata and create subject oriented historical database.
It extracts data from multiple sources –DB2, IMS, VSAM, RMS &sequential
files.
16. Data extraction, clean up and
transformation tools
4.2 SAS institute:
SAS data access engines serve as a extraction tools to combine common
variables, transform data
Representations forms for consistency.
it support for decision reporting ,graphing .so it act as the front end.
4.3 Carleton corporations PASSPORT and metacenter:
Carleton’s PASSPORT and the MetaCenter fulfill the data extraction and
transformation need of data warehousing.
17. Metadata
1.Metadata defined
Data about data, It contains Location and description of dw.
Names, definition, structure and content of the dw.
Identification of data sources.
Integration and transformation rules to populate dw and end user.
Information delivery information
Data warehouse operational information
Security authorization
Metadata interchange initiative
It is used for develop the standard specifications to exchange metadata
18. Metadata
2. Metadata Interchange initiative
It used for develop the standard specifications for metadata interchange
format it will allow Vendors to exchange common metadata for avoid
difficulties of exchanging, sharing and Managing metadata
The initial goals include
Creating a vendor-independent, industry defined and maintained standard access
mechanism and standard API
Enabling individual tools to satisfy their specific metadata for access
requirements, freely and easily within the context of an interchange model.
Defining a clean simple, interchange implementation infrastructure.
Creating a process and procedures for extending and updating.
19. Metadata
Metadata Interchange initiative have define two distinct Meta models
The application Metamodel- it holds the metadata for particular application
The metadata Metamodel- set of objects that the metadata interchange
standard can be used to describe
The above models represented by one or more classes of tools (data extraction,
cleanup, replication)
Metadata interchange standard framework
Metadata itself store any type of storage facility or format such as relational tables,
ASCII files ,fixed format or customized formats the Metadata interchange standard
framework will translate the an access request into interchange standard syntax and
format
20. Metadata
Metadata interchange standard framework - Accomplish following
approach
Procedural approach-
ASCII batch approach-ASCII file containing metadata standard schema
and access parameters is reloads when over a tool access metadata
through API
Hybrid approach-it follow a data driven model by implementing table
driven API, that would support only fully qualified references for each
metadata
The Components of the metadata interchange standard frame work.
The standard metadata model-which refer the ASCII file format used to
represent the metadata
21. Metadata
The standard access framework-describe the minimum number of API
function for communicate metadata.
Tool profile-the tool profile is a file that describes what aspects of the
interchange standard metamodel a particular tool supports.
The user configuration-which is a file describing the legal interchange
paths for metadata in the users environment.
22. Metadata
3. Metadata Repository
It is implemented as a part of the data warehouse frame work it following
benefits
It provides a enterprise wide metadata management.
It reduces and eliminates information redundancy, inconsistency
It simplifies management and improves organization control
It increase flexibility, control, and reliability of application development
Ability to utilize existing applications
It eliminates redundancy with ability to share and reduce metadata
23. Metadata
4. Metadata Management
The collecting, maintain and distributing metadata is needed for a successful
data warehouse implementation so these tool need to be carefully evaluated
before any purchasing decision is made
5. Implementation Example
Implementation approaches adopted by
platinum technology,
R&O,
prism solutions, and
logical works
24. Metadata
6. Metadata trends
The process of integrating external and external data into the warehouse faces
a number of challenges
Inconsistent data formats
Missing or invalid data
Different level of aggregation
Semantic inconsistency
Different types of database (text, audio, full-motion, images, temporal
databases, etc..)
The above issues put an additional burden on the collection and management
of common metadata definition this is addressed by Metadata Coalition’s
metadata interchange specification
25. Reporting, Query Tools and
Applications
Tool Categories: There are five categories of decision support tools
Reporting
Managed query
Executive information system
OLAP
Data Mining
Reporting Tools
Production Reporting Tools
Companies generate regular operational reports or support high volume batch
jobs, such as calculating and printing pay checks
Report writers
Crystal Reports/Accurate reporting system
User design and run reports without having to rely on the IS department
26. Reporting, Query Tools and
Applications
Managed query tools
Managed query tools shield end user from the Complexities of SQL and database
structures by inserting a metalayer between user and the database
Metalayer :Software that provides subject oriented views of a database and
supports point and click creation of SQL
Executive information system
First deployed on main frame system
Predate report writer and managed query tools
Build customized, graphical decision support apps or briefing books
Provides high level view of the business and access to external sources eg
custom, on-line news feed
EIS Apps highlight exceptions to business activity or rules by using color-coded
graphics
27. Reporting, Query Tools and
Applications
OLAP Tools
Provide an intuitive way to view corporate data
Provide navigation through the hierarchies and dimensions with the single click
Aggregate data along common business subjects or dimensions
Users can drill down across ,or up levels
Data mining Tools
Provide insights into corporate data that are nor easily discerned with managed
query or OLAP tools
Use variety of statistical and AI algorithm to analyze the correlation of variables
in data
28. Data Warehousing - OLAP
OLAP stands for Online Analytical Processing.
It uses database tables (fact and dimension tables) to enable multidimensional
viewing, analysis and querying of large amounts of data.
E.g. OLAP technology could provide management with fast answers to complex
queries on their operational data or enable them to analyze their company’s
historical data for trends and patterns.
Online Analytical Processing (OLAP) applications and tools are those that are
designed to ask ―complex queries of large multidimensional collections of
data. Due to that OLAP is accompanied with data warehousing.
29. Data Warehousing - OLAP
Need
The key driver of OLAP is the multidimensional nature of the business
problem.
These problems are characterized by retrieving a very large number of
records that can reach gigabytes and terabytes and summarizing this data into
a form information that can by used by business analysts.
One of the limitations that SQL has, it cannot represent these complex
problems.
A query will be translated in to several SQL statements. These SQL
statements will involve multiple joins, intermediate tables, sorting,
aggregations and a huge temporary memory to store these tables.
30. Data Warehousing - OLAP
Online Analytical Processing Server (OLAP) is based on the
multidimensional data model.
It allows managers, and analysts to get an insight of the information through
fast, consistent, and interactive access to information.
Provide an intuitive way to view corporate data.
Types of OLAP Servers:
We have four types of OLAP servers −
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Specialized SQL Servers
31. OLAP Vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
1 Involves historical processing of
information.
Involves day-to-day processing.
2 OLAP systems are used by knowledge
workers such as executives, managers
and analysts.
OLTP systems are used by clerks, DBAs,
or database professionals.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake,
Schema and Fact Constellation Schema.
Based on Entity Relationship Model.
6 Contains historical data. Contains current data.
32. OLAP Vs OLTP
Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)
7 Provides summarized and
consolidated data.
Provides primitive and highly detailed
data.
8 Provides summarized and
multidimensional view of data.
Provides detailed and flat relational
view of data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in
millions.
Number of records accessed is in tens.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
33. Multidimensional Data Model
The multidimensional data model is an integral part of On-Line Analytical
Processing, or OLAP.
Multidimensional data model is to view it as a cube. The cable at the left
contains detailed sales data by product, market and time. The cube on the
right associates sales number (unit sold) with dimensions-product type,
market and time with the unit variables organized as cell in an array.
This cube can be expended to include another array-price-which can be
associates with all or only some dimensions. As number of dimensions
increases number of cubes cell increase exponentially.
34. ETL Process in Data Warehouse
ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse. The process of ETL can be broken down into the following three
stages:
Extract: The first stage in the ETL process is to extract data from various
sources such as transactional systems, spreadsheets, and flat files. This step
involves reading data from the source systems and storing it in a staging area.
Transform: In this stage, the extracted data is transformed into a format that is
suitable for loading into the data warehouse. This may involve cleaning and
validating the data, converting data types, combining data from multiple
sources, and creating new data fields.
35. ETL Process in Data Warehouse
Load: After the data is transformed, it is loaded into the data warehouse. This
step involves creating the physical data structures and loading the data into
the warehouse.
The ETL process is an iterative process that is repeated as new data is added
to the warehouse. The process is important because it ensures that the data in
the data warehouse is accurate, complete, and up-to-date. It also helps to
ensure that the data is in the format required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available,
such as Informatica, Talend, DataStage, and others, that can automate and
simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform
and Load. It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area, and then finally, loads it
into the Data Warehouse system.
36. ETL Process in Data Warehouse
ETL Tools: Most
commonly used
ETL tools
are Hevo,
Sybase, Oracle
Warehouse
builder,
CloverETL, and
MarkLogic.
Data
Warehouses: M
ost commonly
used Data
Warehouses
are Snowflake,
Redshift,
BigQuery, and
Overall, ETL process is an essential process in data
warehousing that helps to ensure that the data in the data
warehouse is accurate, complete, and up-to-date.
37. ETL Process
ADVANTAGES and DISADVANTAGES
Advantages of ETL process in data warehousing:
Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.
Better data integration: ETL process helps to integrate data from multiple
sources and systems, making it more accessible and usable.
Increased data security: ETL process can help to improve data security by
controlling access to the data warehouse and ensuring that only authorized
users can access the data.
Improved scalability: ETL process can help to improve scalability by
providing a way to manage and analyze large amounts of data.
Increased automation: ETL tools and technologies can automate and simplify
the ETL process, reducing the time and effort required to load and update data
in the warehouse.
38. ETL Process
ADVANTAGES OR DISADVANTAGES
Disadvantages of ETL process in data warehousing:
High cost: ETL process can be expensive to implement and maintain,
especially for organizations with limited resources.
Complexity: ETL process can be complex and difficult to implement,
especially for organizations that lack the necessary expertise or resources.
Limited flexibility: ETL process can be limited in terms of flexibility, as it
may not be able to handle unstructured data or real-time data streams.
Limited scalability: ETL process can be limited in terms of scalability, as it
may not be able to handle very large amounts of data.
Data privacy concerns: ETL process can raise concerns about data privacy, as
large amounts of data are collected, stored, and analyzed.
39. 10 Best Data Warehouse Tools to Explore
in 2023
1. Hevo Data
2. Amazon Web Services Data Warehouse Tools
3. Google Data Warehouse Tools
4. Microsoft Azure Data Warehouse Tools
5. Oracle Autonomous Data Warehouse
6. Snowflake
7. IBM Data Warehouse Tools
8. Teradata Vantage
9. SAS Cloud
10. SAP Data Warehouse Cloud
40. IMPORTANT WEBSITE LINKS
1. AWS Redshift: Best for real-time and predictive analytics
2. Oracle Autonomous Data Warehouse: Best for autonomous management
capabilities
3. Azure Synapse Analytics: Best for intelligent workload management
4. IBM Db2 Warehouse: Best for fully managed cloud versions
5. Teradata Vantage: Best for enhanced analytics capabilities
6. SAP BW/4HANA: Best for advanced analytics and tailored applications
7. Google BigQuery: Best for built-in query acceleration and serverless
architecture
8. Snowflake for Data Warehouse: Best for separate computation and storage
41. IMPORTANT WEBSITE LINKS
9. Cloudera Data Platform: Best for faster scaling
10. Micro Focus Vertica: Best for improved query performance
11. MarkLogic: Best for complex data challenges
12. MongoDB: Best for sophisticated access management
13. Talend: Best for simplified data governance
14. Informatica: Best for intelligent data management
15. Arm Treasure Data: Best for connected customer experience