UNIT - 4: Data Warehousing and Data Mining

DATA WAREHOUSING AND
DATA MINING
UNIT – 4
Prepared by
Mr. P. Nandakumar
Assistant Professor,
Department of IT, SVCET

UNIT-IV
Cluster Analysis: Types of Data in Cluster Analysis – A Categorization of
Major Clustering Methods – Partitioning Methods – Hierarchical methods
– Density, Based Methods – Grid, Based Methods – Model, Based
Clustering Methods – Clustering High, Dimensional Data – Constraint,
Based Cluster Analysis – Outlier Analysis.

Cluster Analysis: Types of Data in
Cluster Analysis
• Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together.
• The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points
within each group are more similar to each other than to data points in other groups.
• This process is often used for exploratory data analysis and can help identify patterns or relationships
within the data that may not be immediately obvious.
• There are many different algorithms used for cluster analysis, such as k-means, hierarchical
clustering, and density-based clustering.
• The choice of algorithm will depend on the specific requirements of the analysis and the nature of the
data being analyzed.

Cluster Analysis: Types of Data in
Cluster Analysis
• Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.
• The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped together.
• For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.

Properties of Clustering
Clustering Scalability
High Dimensionality
Algorithm Usability with multiple data kinds
Dealing with unstructured data
Interpretability

Types of Cluster Analysis
Broadly, there are 2 types of cluster analysis methods. On the basis of the categorization of data sets
into a particular cluster, cluster analysis can be divided into 2 types - hard and soft clustering. They are
as follows,
1. Hard Clustering
• In a given dataset, it is possible for a data researcher to organize clusters in a manner that a single
dataset is placed in only one of the total number of given clusters.
• This implies that a hard-core classification of datasets is required in order to organize and classify
data accordingly.
• For instance, a clustering algorithm classifies data points in one cluster such that they have the
maximum similarity. However, there are no other grounds of similarity with data sets belonging to
other clusters.

Types of Cluster Analysis
2. Soft Clustering
• The second class of cluster analysis is Soft Clustering. Unlike hard clustering that requires a given
data point to belong to only a cluster at a time, soft clustering follows a different rule.
• In the case of soft clustering, a given data point can belong to more than one cluster at a time. This
means that a fuzzy classification of datasets characterizes soft clustering.
• Fuzzy Clustering Algorithm in Machine learning is a renowned unsupervised algorithm for
processing data into soft clusters.

Applications of Cluster Analysis in
Real Time
Data Science
Marketing
Social Network Analysis
Image Segmentation
Collaborative Filtering

A Categorization of Major Clustering
Methods
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method

Partitioning Method
Partitioning Method: It is used to make partitions on the data in order to form
clusters. If “n” partitions are done on “p” objects of the database then each partition
is represented by a cluster and n < p. The two conditions which need to be satisfied
with this Partitioning Clustering Method are:
One objective should only belong to only one group.
There should be no group without even a single purpose.
 In the partitioning method, there is one technique called iterative relocation, which
means the object will be moved from one group to another to improve the
partitioning

Partitioning Method
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the partitioning method
constructs user-specified(K) partitions of the data in which each partition represents a
cluster and a particular region. There are many algorithms that come under
partitioning method some of the popular ones are K-Mean, PAM(K-Medoids),
CLARA algorithm (Clustering Large Applications) etc.

Partitioning Method
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
• Randomly assign K objects from the dataset(D) as cluster centres(C)
• (Re) Assign each object to which object is most similar based upon mean values.
• Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
• Repeat Step 2 until no change occurs.

Hierarchical Method
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose
of classification on the basis of how the hierarchical decomposition is formed. There are two
types of approaches for the creation of hierarchical decomposition, they are:
1. Agglomerative Approach
2. Divisive Approach

Hierarchical Method
1. Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another which
means that they exhibit similar properties. This merging process continues until the
termination condition holds.
2. Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one object.

Hierarchical Method
Once the group is split or merged then it can never be undone as it is a rigid method and is not
so flexible. The two approaches which can be used to improve the Hierarchical Clustering
Quality in Data Mining are: –
One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.

Density-Based Method
Density-Based Method: The density-based method mainly focuses on density. In this
method, the given cluster will keep on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data point within a given cluster. The
radius of a given cluster has to contain at least a minimum number of points.
Density-Based Clustering Methods:
• DBSCAN,
• OPTICS,
• DENCLUE

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It
depends on a density-based notion of cluster. It also identifies clusters of arbitrary size in the
spatial database with outliers.
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a
significant order of database with respect to its density-based clustering structure. The order
of the cluster comprises information equivalent to the density-based clustering related to a
long range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.
DENCLUE - Density-based clustering by Hinnebirg and Kiem. It enables a compact
mathematical description of arbitrarily shaped clusters in high dimension state of data, and it
is good for data sets with a huge amount of noise.

Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
• It is a scan method.
• It requires density parameters as a termination condition.
• It is used to manage noise in data clusters.
• Density-based clustering is used to identify clusters of arbitrary size.

Grid-Based Method
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,
i.e, the object space is quantized into a finite number of cells that form a grid structure. One of
the major advantages of the grid-based method is fast processing time and it is dependent only
on the number of cells in each dimension in the quantized space. The processing time for this
method is much faster so it can save time.
An instance of the grid-based approach involves STING, which explores statistical data stored
in the grid cells, WaveCluster, which clusters objects using a wavelet transform approach, and
CLIQUE, which defines a grid-and density-based approach for clustering in high-dimensional
data space.

Grid-Based Method
• STING is a grid-based multiresolution clustering method in which the spatial area is
divided into rectangular cells.
• There are generally several levels of such rectangular cells corresponding to multiple levels
of resolution, and these cells form a hierarchical mechanism each cell at a high level is
separation to form several cells at the next lower level.
• Statistical data regarding the attributes in each grid cell (including the mean, maximum, and
minimum values) is precomputed and stored.

Grid-Based Method
Several interesting methods
• STING (a STatistical INformation Grid approach) by Wang, Yang, and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) - A multi-resolution
clustering approach using wavelet method
• CLIQUE - Agrawal, et al. (SIGMOD’98)

Model-Based Method
Model-Based Method: In the model-based method, all the clusters are hypothesized in order
to find the data which is best suited for the model. The clustering of the density function is
used to locate the clusters for a given model. It reflects the spatial distribution of data points
and also provides a way to automatically determine the number of clusters based on standard
statistics, taking outlier or noise into account. Therefore it yields robust clustering methods.
Model-based clustering is a try to advance the fit between the given data and some
mathematical model and is based on the assumption that data are created by a combination of
a basic probability distribution.

Model-Based Method
There are the following types of model-based clustering are as follows −
Statistical approach − Expectation maximization is a popular iterative refinement algorithm.
An extension to k-means −
It can assign each object to a cluster according to weight (probability distribution).
New means are computed based on weight measures.
The basic idea is as follows −
• It can start with an initial estimate of the parameter vector.
• It can be used to iteratively rescore the designs against the mixture density made by the
parameter vector.
• It is used to rescored patterns are used to update the parameter estimates.
• It can be used to pattern belonging to the same cluster if they are placed by their scores in a
particular component.

Model-Based Method
Machine learning approach − Machine learning is an approach that makes complex
algorithms for huge data processing and supports results to its users. It uses complex
programs that can understand through experience and create predictions.
The algorithms are improved by themselves by frequent input of training information. The
main objective of machine learning is to learn data and build models from data that can be
understood and used by humans.
It is a famous approach of incremental conceptual learning, which produces a hierarchical
clustering in the form of a classification tree. Each node defines a concept and includes a
probabilistic representation of that concept.
Limitations:
• The assumption that the attributes are independent of each other is often too strong because
correlation can exist.
• It is not suitable for clustering large database data, skewed trees, and expensive probability
distributions.

Model-Based Method
Neural Network Approach − The neural network approach represents each cluster as an
example, acting as a prototype of the cluster. The new objects are distributed to the cluster
whose example is the most similar according to some distance measure.
Since the quality of clustering is not only dependent on the distribution of data points but also
on the learned representation, deep neural networks can be effective means to transform
mappings from a high-dimensional data space into a lower-dimensional feature space, leading
to improved clustering results.

Constraint-Based Method
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.

Methods For Clustering With Constraints:
There are various methods for clustering with constraints and can handle specific constraints:
Handling Hard Constraints: There is a method for handling the hard constraints by
regarding the constraint in a cluster assignment procedure. It is a very important method for
handling the difficult constraints we can regard the constraints in the assignment procedure of
cluster.
Generating the super instances for must-link constraints: There are must-link constraints
that have transitive closure that can be calculated by it. so that we can say that must-link
constraints are known as an equivalence relation. The subset can be defined by it. In subset,
there are some objects which can be replaced by the mean.

Handling the soft constraints: In the clustering process of soft constraints, there is always an
optimization process. There is always a penalty requires in the clustering process. Hence the
optimization in this process’s aim is to optimize the constraint violation and decreasing the
clustering aspect. For example, if we take two sets one is of data sets and the other is a set of
constraints, CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm,
K-means clustering is enforced to constraint violation penalty.
The main objective of CVQE is the total distance used for K-means which are used as
follows:
• Penalty in must-link violation
• Penalty in cannot-link violation:

There are several categories of constraints which
are as follows −
Constraints on individual objects
Constraints on the selection of clustering parameters
Constraints on distance or similarity functions
User-specified constraints on the properties of individual clusters
Semi-supervised clustering based on “partial” supervision

Outlier Analysis
“Outliers" refer to the data points that exist outside of what is to be expected. The major thing
about the outliers is what you do with them. If you are going to analyze any task to analyze data
sets, you will always have some assumptions based on how this data is generated.
Difference between outliers and noise
Any unwanted error occurs in some previously measured variable, or there is any variance in
the previously measured variable called noise. Before finding the outliers present in any data
set, it is recommended first to remove the noise.
Types of Outliers: Outliers are divided into three different types
1. Global or point outliers
2. Collective outliers
3. Contextual or conditional outliers

Outlier Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in many
applications like fraud detection, medical, etc. It is usually because the events that occur rarely
can store much more significant information than the events that occur more regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.
• Fraud detection in the telecom industry
• In market analysis, outlier analysis enables marketers to identify the customer's behaviors.
• In the Medical analysis field.
• Fraud detection in banking and finance such as credit cards, insurance sector, etc.
The process in which the behavior of the outliers is identified in a dataset is called outlier
analysis. It is also known as "outlier mining", the process is defined as a significant task of data
mining.

Applications Of Cluster Analysis
It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.

Applications Of Cluster Analysis
It helps in allocating documents on the internet for data discovery.
Clustering is also used in tracking applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
In terms of biology, It can be used to determine plant and animal taxonomies, categorization
of genes with the same functionalities and gain insight into structure inherent to
populations.
It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
geographical location.

Advantages of Cluster Analysis
 It can help identify patterns and relationships within a dataset that may not be immediately
obvious.
 It can be used for exploratory data analysis and can help with feature selection.
 It can be used to reduce the dimensionality of the data.
 It can be used for anomaly detection and outlier identification.
 It can be used for market segmentation and customer profiling.

Disadvantages of Cluster Analysis
 It can be sensitive to the choice of initial conditions and the number of clusters.
 It can be sensitive to the presence of noise or outliers in the data.
 It can be difficult to interpret the results of the analysis if the clusters are not well-defined.
 It can be computationally expensive for large datasets.
 The results of the analysis can be affected by the choice of clustering algorithm used.
 It is important to note that the success of cluster analysis depends on the data, the goals of
the analysis, and the ability of the analyst to interpret the results.

Additional Content on Cluster Analysis

What is a Cluster?
• A cluster is a subset of similar objects
• A subset of objects such that the distance between any of the two objects in the
cluster is less than the distance between any object in the cluster and any object
that is not located inside it.
• A connected region of a multidimensional space with a comparatively high density
of objects.

What is clustering in Data Mining?
• Clustering is the method of converting a group of abstract objects into classes of
similar objects.
• Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
• It helps users to understand the structure or natural grouping in a data set and used
either as a stand-alone instrument to get a better insight into data distribution or as
a pre-processing step for other algorithms

Important points
• Data objects of a cluster can be considered as one group.
• We first partition the information set into groups while doing cluster analysis. It is
based on data similarities and then assigns the levels to the groups.
• The over-classification main advantage is that it is adaptable to modifications, and
it helps single out important characteristics that differentiate between distinct
groups.

Why is clustering used in data mining?
• Clustering analysis has been an evolving problem in data mining due to its variety
of applications.
• The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must
contribute to the popularity of these algorithms.
• The main issue with the data clustering algorithms is that it cant be standardized.

Why is clustering used in data mining?
• The advanced algorithm may give the best results with one type of data set, but it
may fail or perform poorly with other kinds of data set.
• Although many efforts have been made to standardize the algorithms that can
perform well in all situations, no significant achievement has been achieved so far.
• Many clustering tools have been proposed so far. However, each algorithm has its
advantages or disadvantages and cant work on all real situations.

UNIT - 4: Data Warehousing and Data Mining

Recommended

Recommended

More Related Content

Similar to UNIT - 4: Data Warehousing and Data Mining

Similar to UNIT - 4: Data Warehousing and Data Mining (20)

More from Nandakumar P

More from Nandakumar P (17)

Recently uploaded

Recently uploaded (20)

UNIT - 4: Data Warehousing and Data Mining