SlideShare a Scribd company logo
1 of 41
DATA WAREHOUSING AND
DATA MINING
UNIT – 4
Prepared by
Mr. P. Nandakumar
Assistant Professor,
Department of IT, SVCET
UNIT-IV
Cluster Analysis: Types of Data in Cluster Analysis – A Categorization of
Major Clustering Methods – Partitioning Methods – Hierarchical methods
– Density, Based Methods – Grid, Based Methods – Model, Based
Clustering Methods – Clustering High, Dimensional Data – Constraint,
Based Cluster Analysis – Outlier Analysis.
Cluster Analysis: Types of Data in
Cluster Analysis
• Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together.
• The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points
within each group are more similar to each other than to data points in other groups.
• This process is often used for exploratory data analysis and can help identify patterns or relationships
within the data that may not be immediately obvious.
• There are many different algorithms used for cluster analysis, such as k-means, hierarchical
clustering, and density-based clustering.
• The choice of algorithm will depend on the specific requirements of the analysis and the nature of the
data being analyzed.
Cluster Analysis: Types of Data in
Cluster Analysis
• Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.
• The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped together.
• For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
Properties of Clustering
Clustering Scalability
High Dimensionality
Algorithm Usability with multiple data kinds
Dealing with unstructured data
Interpretability
Types of Cluster Analysis
Broadly, there are 2 types of cluster analysis methods. On the basis of the categorization of data sets
into a particular cluster, cluster analysis can be divided into 2 types - hard and soft clustering. They are
as follows,
1. Hard Clustering
• In a given dataset, it is possible for a data researcher to organize clusters in a manner that a single
dataset is placed in only one of the total number of given clusters.
• This implies that a hard-core classification of datasets is required in order to organize and classify
data accordingly.
• For instance, a clustering algorithm classifies data points in one cluster such that they have the
maximum similarity. However, there are no other grounds of similarity with data sets belonging to
other clusters.
Types of Cluster Analysis
2. Soft Clustering
• The second class of cluster analysis is Soft Clustering. Unlike hard clustering that requires a given
data point to belong to only a cluster at a time, soft clustering follows a different rule.
• In the case of soft clustering, a given data point can belong to more than one cluster at a time. This
means that a fuzzy classification of datasets characterizes soft clustering.
• Fuzzy Clustering Algorithm in Machine learning is a renowned unsupervised algorithm for
processing data into soft clusters.
Applications of Cluster Analysis in
Real Time
Data Science
Marketing
Social Network Analysis
Image Segmentation
Collaborative Filtering
A Categorization of Major Clustering
Methods
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Partitioning Method: It is used to make partitions on the data in order to form
clusters. If “n” partitions are done on “p” objects of the database then each partition
is represented by a cluster and n < p. The two conditions which need to be satisfied
with this Partitioning Clustering Method are:
One objective should only belong to only one group.
There should be no group without even a single purpose.
 In the partitioning method, there is one technique called iterative relocation, which
means the object will be moved from one group to another to improve the
partitioning
Partitioning Method
This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the partitioning method
constructs user-specified(K) partitions of the data in which each partition represents a
cluster and a particular region. There are many algorithms that come under
partitioning method some of the popular ones are K-Mean, PAM(K-Medoids),
CLARA algorithm (Clustering Large Applications) etc.
Partitioning Method
Algorithm: K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
• Randomly assign K objects from the dataset(D) as cluster centres(C)
• (Re) Assign each object to which object is most similar based upon mean values.
• Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
• Repeat Step 2 until no change occurs.
Hierarchical Method
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose
of classification on the basis of how the hierarchical decomposition is formed. There are two
types of approaches for the creation of hierarchical decomposition, they are:
1. Agglomerative Approach
2. Divisive Approach
Hierarchical Method
1. Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another which
means that they exhibit similar properties. This merging process continues until the
termination condition holds.
2. Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one object.
Hierarchical Method
Once the group is split or merged then it can never be undone as it is a rigid method and is not
so flexible. The two approaches which can be used to improve the Hierarchical Clustering
Quality in Data Mining are: –
One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method
Density-Based Method: The density-based method mainly focuses on density. In this
method, the given cluster will keep on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data point within a given cluster. The
radius of a given cluster has to contain at least a minimum number of points.
Density-Based Clustering Methods:
• DBSCAN,
• OPTICS,
• DENCLUE
Density-Based Method
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It
depends on a density-based notion of cluster. It also identifies clusters of arbitrary size in the
spatial database with outliers.
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a
significant order of database with respect to its density-based clustering structure. The order
of the cluster comprises information equivalent to the density-based clustering related to a
long range of parameter settings. OPTICS methods are beneficial for both automatic and
interactive cluster analysis, including determining an intrinsic clustering structure.
DENCLUE - Density-based clustering by Hinnebirg and Kiem. It enables a compact
mathematical description of arbitrarily shaped clusters in high dimension state of data, and it
is good for data sets with a huge amount of noise.
Density-Based Method
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
• It is a scan method.
• It requires density parameters as a termination condition.
• It is used to manage noise in data clusters.
• Density-based clustering is used to identify clusters of arbitrary size.
Grid-Based Method
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,
i.e, the object space is quantized into a finite number of cells that form a grid structure. One of
the major advantages of the grid-based method is fast processing time and it is dependent only
on the number of cells in each dimension in the quantized space. The processing time for this
method is much faster so it can save time.
An instance of the grid-based approach involves STING, which explores statistical data stored
in the grid cells, WaveCluster, which clusters objects using a wavelet transform approach, and
CLIQUE, which defines a grid-and density-based approach for clustering in high-dimensional
data space.
Grid-Based Method
• STING is a grid-based multiresolution clustering method in which the spatial area is
divided into rectangular cells.
• There are generally several levels of such rectangular cells corresponding to multiple levels
of resolution, and these cells form a hierarchical mechanism each cell at a high level is
separation to form several cells at the next lower level.
• Statistical data regarding the attributes in each grid cell (including the mean, maximum, and
minimum values) is precomputed and stored.
Grid-Based Method
Several interesting methods
• STING (a STatistical INformation Grid approach) by Wang, Yang, and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) - A multi-resolution
clustering approach using wavelet method
• CLIQUE - Agrawal, et al. (SIGMOD’98)
Model-Based Method
Model-Based Method: In the model-based method, all the clusters are hypothesized in order
to find the data which is best suited for the model. The clustering of the density function is
used to locate the clusters for a given model. It reflects the spatial distribution of data points
and also provides a way to automatically determine the number of clusters based on standard
statistics, taking outlier or noise into account. Therefore it yields robust clustering methods.
Model-based clustering is a try to advance the fit between the given data and some
mathematical model and is based on the assumption that data are created by a combination of
a basic probability distribution.
Model-Based Method
There are the following types of model-based clustering are as follows −
Statistical approach − Expectation maximization is a popular iterative refinement algorithm.
An extension to k-means −
It can assign each object to a cluster according to weight (probability distribution).
New means are computed based on weight measures.
The basic idea is as follows −
• It can start with an initial estimate of the parameter vector.
• It can be used to iteratively rescore the designs against the mixture density made by the
parameter vector.
• It is used to rescored patterns are used to update the parameter estimates.
• It can be used to pattern belonging to the same cluster if they are placed by their scores in a
particular component.
Model-Based Method
Machine learning approach − Machine learning is an approach that makes complex
algorithms for huge data processing and supports results to its users. It uses complex
programs that can understand through experience and create predictions.
The algorithms are improved by themselves by frequent input of training information. The
main objective of machine learning is to learn data and build models from data that can be
understood and used by humans.
It is a famous approach of incremental conceptual learning, which produces a hierarchical
clustering in the form of a classification tree. Each node defines a concept and includes a
probabilistic representation of that concept.
Limitations:
• The assumption that the attributes are independent of each other is often too strong because
correlation can exist.
• It is not suitable for clustering large database data, skewed trees, and expensive probability
distributions.
Model-Based Method
Neural Network Approach − The neural network approach represents each cluster as an
example, acting as a prototype of the cluster. The new objects are distributed to the cluster
whose example is the most similar according to some distance measure.
Since the quality of clustering is not only dependent on the distribution of data points but also
on the learned representation, deep neural networks can be effective means to transform
mappings from a high-dimensional data space into a lower-dimensional feature space, leading
to improved clustering results.
Constraint-Based Method
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.
Constraint-Based Method
Methods For Clustering With Constraints:
There are various methods for clustering with constraints and can handle specific constraints:
Handling Hard Constraints: There is a method for handling the hard constraints by
regarding the constraint in a cluster assignment procedure. It is a very important method for
handling the difficult constraints we can regard the constraints in the assignment procedure of
cluster.
Generating the super instances for must-link constraints: There are must-link constraints
that have transitive closure that can be calculated by it. so that we can say that must-link
constraints are known as an equivalence relation. The subset can be defined by it. In subset,
there are some objects which can be replaced by the mean.
Constraint-Based Method
Handling the soft constraints: In the clustering process of soft constraints, there is always an
optimization process. There is always a penalty requires in the clustering process. Hence the
optimization in this process’s aim is to optimize the constraint violation and decreasing the
clustering aspect. For example, if we take two sets one is of data sets and the other is a set of
constraints, CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm,
K-means clustering is enforced to constraint violation penalty.
The main objective of CVQE is the total distance used for K-means which are used as
follows:
• Penalty in must-link violation
• Penalty in cannot-link violation:
There are several categories of constraints which
are as follows −
Constraints on individual objects
Constraints on the selection of clustering parameters
Constraints on distance or similarity functions
User-specified constraints on the properties of individual clusters
Semi-supervised clustering based on “partial” supervision
Outlier Analysis
“Outliers" refer to the data points that exist outside of what is to be expected. The major thing
about the outliers is what you do with them. If you are going to analyze any task to analyze data
sets, you will always have some assumptions based on how this data is generated.
Difference between outliers and noise
Any unwanted error occurs in some previously measured variable, or there is any variance in
the previously measured variable called noise. Before finding the outliers present in any data
set, it is recommended first to remove the noise.
Types of Outliers: Outliers are divided into three different types
1. Global or point outliers
2. Collective outliers
3. Contextual or conditional outliers
Outlier Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in many
applications like fraud detection, medical, etc. It is usually because the events that occur rarely
can store much more significant information than the events that occur more regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.
• Fraud detection in the telecom industry
• In market analysis, outlier analysis enables marketers to identify the customer's behaviors.
• In the Medical analysis field.
• Fraud detection in banking and finance such as credit cards, insurance sector, etc.
The process in which the behavior of the outliers is identified in a dataset is called outlier
analysis. It is also known as "outlier mining", the process is defined as a significant task of data
mining.
Applications Of Cluster Analysis
It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
Applications Of Cluster Analysis
It helps in allocating documents on the internet for data discovery.
Clustering is also used in tracking applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
In terms of biology, It can be used to determine plant and animal taxonomies, categorization
of genes with the same functionalities and gain insight into structure inherent to
populations.
It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
geographical location.
Advantages of Cluster Analysis
 It can help identify patterns and relationships within a dataset that may not be immediately
obvious.
 It can be used for exploratory data analysis and can help with feature selection.
 It can be used to reduce the dimensionality of the data.
 It can be used for anomaly detection and outlier identification.
 It can be used for market segmentation and customer profiling.
Disadvantages of Cluster Analysis
 It can be sensitive to the choice of initial conditions and the number of clusters.
 It can be sensitive to the presence of noise or outliers in the data.
 It can be difficult to interpret the results of the analysis if the clusters are not well-defined.
 It can be computationally expensive for large datasets.
 The results of the analysis can be affected by the choice of clustering algorithm used.
 It is important to note that the success of cluster analysis depends on the data, the goals of
the analysis, and the ability of the analyst to interpret the results.
Additional Content on Cluster Analysis
What is a Cluster?
• A cluster is a subset of similar objects
• A subset of objects such that the distance between any of the two objects in the
cluster is less than the distance between any object in the cluster and any object
that is not located inside it.
• A connected region of a multidimensional space with a comparatively high density
of objects.
What is clustering in Data Mining?
• Clustering is the method of converting a group of abstract objects into classes of
similar objects.
• Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
• It helps users to understand the structure or natural grouping in a data set and used
either as a stand-alone instrument to get a better insight into data distribution or as
a pre-processing step for other algorithms
Important points
• Data objects of a cluster can be considered as one group.
• We first partition the information set into groups while doing cluster analysis. It is
based on data similarities and then assigns the levels to the groups.
• The over-classification main advantage is that it is adaptable to modifications, and
it helps single out important characteristics that differentiate between distinct
groups.
Why is clustering used in data mining?
• Clustering analysis has been an evolving problem in data mining due to its variety
of applications.
• The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must
contribute to the popularity of these algorithms.
• The main issue with the data clustering algorithms is that it cant be standardized.
Why is clustering used in data mining?
• The advanced algorithm may give the best results with one type of data set, but it
may fail or perform poorly with other kinds of data set.
• Although many efforts have been made to standardize the algorithms that can
perform well in all situations, no significant achievement has been achieved so far.
• Many clustering tools have been proposed so far. However, each algorithm has its
advantages or disadvantages and cant work on all real situations.

More Related Content

Similar to UNIT - 4: Data Warehousing and Data Mining

K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDatamining Tools
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
METHODS OF CLUSTER ANALYSIS.pptx
METHODS OF CLUSTER ANALYSIS.pptxMETHODS OF CLUSTER ANALYSIS.pptx
METHODS OF CLUSTER ANALYSIS.pptxagniva pradhan
 
1. METHODS OF CLUSTER ANALYSIS.pptx
1. METHODS OF CLUSTER ANALYSIS.pptx1. METHODS OF CLUSTER ANALYSIS.pptx
1. METHODS OF CLUSTER ANALYSIS.pptxagniva pradhan
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrievalBasma Gamal
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxDr.Shweta
 

Similar to UNIT - 4: Data Warehousing and Data Mining (20)

K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
METHODS OF CLUSTER ANALYSIS.pptx
METHODS OF CLUSTER ANALYSIS.pptxMETHODS OF CLUSTER ANALYSIS.pptx
METHODS OF CLUSTER ANALYSIS.pptx
 
1. METHODS OF CLUSTER ANALYSIS.pptx
1. METHODS OF CLUSTER ANALYSIS.pptx1. METHODS OF CLUSTER ANALYSIS.pptx
1. METHODS OF CLUSTER ANALYSIS.pptx
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
Data clustring
Data clustring Data clustring
Data clustring
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
47 292-298
47 292-29847 292-298
47 292-298
 
Du35687693
Du35687693Du35687693
Du35687693
 
F04463437
F04463437F04463437
F04463437
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
 

More from Nandakumar P

UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningNandakumar P
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
UNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningUNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningNandakumar P
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningNandakumar P
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
UNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON Nandakumar P
 
Python Course for Beginners
Python Course for BeginnersPython Course for Beginners
Python Course for BeginnersNandakumar P
 
CS6601-Unit 4 Distributed Systems
CS6601-Unit 4 Distributed SystemsCS6601-Unit 4 Distributed Systems
CS6601-Unit 4 Distributed SystemsNandakumar P
 
Unit-4 Professional Ethics in Engineering
Unit-4 Professional Ethics in EngineeringUnit-4 Professional Ethics in Engineering
Unit-4 Professional Ethics in EngineeringNandakumar P
 
Unit-3 Professional Ethics in Engineering
Unit-3 Professional Ethics in EngineeringUnit-3 Professional Ethics in Engineering
Unit-3 Professional Ethics in EngineeringNandakumar P
 
Naming in Distributed Systems
Naming in Distributed SystemsNaming in Distributed Systems
Naming in Distributed SystemsNandakumar P
 
Unit 3.1 cs6601 Distributed File System
Unit 3.1 cs6601 Distributed File SystemUnit 3.1 cs6601 Distributed File System
Unit 3.1 cs6601 Distributed File SystemNandakumar P
 
Unit 3 cs6601 Distributed Systems
Unit 3 cs6601 Distributed SystemsUnit 3 cs6601 Distributed Systems
Unit 3 cs6601 Distributed SystemsNandakumar P
 
Professional Ethics in Engineering
Professional Ethics in EngineeringProfessional Ethics in Engineering
Professional Ethics in EngineeringNandakumar P
 

More from Nandakumar P (17)

UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data Mining
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
UNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningUNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data Mining
 
UNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data MiningUNIT - 1 Part 2: Data Warehousing and Data Mining
UNIT - 1 Part 2: Data Warehousing and Data Mining
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data Mining
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
UNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 2 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT-1 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
Python Course for Beginners
Python Course for BeginnersPython Course for Beginners
Python Course for Beginners
 
CS6601-Unit 4 Distributed Systems
CS6601-Unit 4 Distributed SystemsCS6601-Unit 4 Distributed Systems
CS6601-Unit 4 Distributed Systems
 
Unit-4 Professional Ethics in Engineering
Unit-4 Professional Ethics in EngineeringUnit-4 Professional Ethics in Engineering
Unit-4 Professional Ethics in Engineering
 
Unit-3 Professional Ethics in Engineering
Unit-3 Professional Ethics in EngineeringUnit-3 Professional Ethics in Engineering
Unit-3 Professional Ethics in Engineering
 
Naming in Distributed Systems
Naming in Distributed SystemsNaming in Distributed Systems
Naming in Distributed Systems
 
Unit 3.1 cs6601 Distributed File System
Unit 3.1 cs6601 Distributed File SystemUnit 3.1 cs6601 Distributed File System
Unit 3.1 cs6601 Distributed File System
 
Unit 3 cs6601 Distributed Systems
Unit 3 cs6601 Distributed SystemsUnit 3 cs6601 Distributed Systems
Unit 3 cs6601 Distributed Systems
 
Professional Ethics in Engineering
Professional Ethics in EngineeringProfessional Ethics in Engineering
Professional Ethics in Engineering
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 

UNIT - 4: Data Warehousing and Data Mining

  • 1. DATA WAREHOUSING AND DATA MINING UNIT – 4 Prepared by Mr. P. Nandakumar Assistant Professor, Department of IT, SVCET
  • 2. UNIT-IV Cluster Analysis: Types of Data in Cluster Analysis – A Categorization of Major Clustering Methods – Partitioning Methods – Hierarchical methods – Density, Based Methods – Grid, Based Methods – Model, Based Clustering Methods – Clustering High, Dimensional Data – Constraint, Based Cluster Analysis – Outlier Analysis.
  • 3. Cluster Analysis: Types of Data in Cluster Analysis • Cluster analysis, also known as clustering, is a method of data mining that groups similar data points together. • The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more similar to each other than to data points in other groups. • This process is often used for exploratory data analysis and can help identify patterns or relationships within the data that may not be immediately obvious. • There are many different algorithms used for cluster analysis, such as k-means, hierarchical clustering, and density-based clustering. • The choice of algorithm will depend on the specific requirements of the analysis and the nature of the data being analyzed.
  • 4. Cluster Analysis: Types of Data in Cluster Analysis • Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points would comprise together to form a cluster in which all the objects would belong to the same group. • The given data is divided into different groups by combining similar objects into a group. This group is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped together. • For example, consider a dataset of vehicles given in which it contains information about different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
  • 5. Properties of Clustering Clustering Scalability High Dimensionality Algorithm Usability with multiple data kinds Dealing with unstructured data Interpretability
  • 6. Types of Cluster Analysis Broadly, there are 2 types of cluster analysis methods. On the basis of the categorization of data sets into a particular cluster, cluster analysis can be divided into 2 types - hard and soft clustering. They are as follows, 1. Hard Clustering • In a given dataset, it is possible for a data researcher to organize clusters in a manner that a single dataset is placed in only one of the total number of given clusters. • This implies that a hard-core classification of datasets is required in order to organize and classify data accordingly. • For instance, a clustering algorithm classifies data points in one cluster such that they have the maximum similarity. However, there are no other grounds of similarity with data sets belonging to other clusters.
  • 7. Types of Cluster Analysis 2. Soft Clustering • The second class of cluster analysis is Soft Clustering. Unlike hard clustering that requires a given data point to belong to only a cluster at a time, soft clustering follows a different rule. • In the case of soft clustering, a given data point can belong to more than one cluster at a time. This means that a fuzzy classification of datasets characterizes soft clustering. • Fuzzy Clustering Algorithm in Machine learning is a renowned unsupervised algorithm for processing data into soft clusters.
  • 8. Applications of Cluster Analysis in Real Time Data Science Marketing Social Network Analysis Image Segmentation Collaborative Filtering
  • 9. A Categorization of Major Clustering Methods Partitioning Method Hierarchical Method Density-based Method Grid-Based Method Model-Based Method Constraint-based Method
  • 10. Partitioning Method Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n” partitions are done on “p” objects of the database then each partition is represented by a cluster and n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method are: One objective should only belong to only one group. There should be no group without even a single purpose.  In the partitioning method, there is one technique called iterative relocation, which means the object will be moved from one group to another to improve the partitioning
  • 11. Partitioning Method This clustering method classifies the information into multiple groups based on the characteristics and similarity of the data. Its the data analysts to specify the number of clusters that has to be generated for the clustering methods. In the partitioning method when database(D) that contains multiple(N) objects then the partitioning method constructs user-specified(K) partitions of the data in which each partition represents a cluster and a particular region. There are many algorithms that come under partitioning method some of the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc.
  • 12. Partitioning Method Algorithm: K mean: Input: K: The number of clusters in which the dataset has to be divided D: A dataset containing N number of objects Output: A dataset of K clusters Method: • Randomly assign K objects from the dataset(D) as cluster centres(C) • (Re) Assign each object to which object is most similar based upon mean values. • Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values. • Repeat Step 2 until no change occurs.
  • 13. Hierarchical Method Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects is created. We can classify hierarchical methods and will be able to know the purpose of classification on the basis of how the hierarchical decomposition is formed. There are two types of approaches for the creation of hierarchical decomposition, they are: 1. Agglomerative Approach 2. Divisive Approach
  • 14. Hierarchical Method 1. Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach. Initially, the given data is divided into which objects form separate groups. Thereafter it keeps on merging the objects or the groups that are close to one another which means that they exhibit similar properties. This merging process continues until the termination condition holds. 2. Divisive Approach: The divisive approach is also known as the top-down approach. In this approach, we would start with the data objects that are in the same cluster. The group of individual clusters is divided into small clusters by continuous iteration. The iteration continues until the condition of termination is met or until each cluster contains one object.
  • 15. Hierarchical Method Once the group is split or merged then it can never be undone as it is a rigid method and is not so flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality in Data Mining are: – One should carefully analyze the linkages of the object at every partitioning of hierarchical clustering. One can use a hierarchical agglomerative algorithm for the integration of hierarchical agglomeration. In this approach, first, the objects are grouped into micro-clusters. After grouping data objects into microclusters, macro clustering is performed on the microcluster.
  • 16. Density-Based Method Density-Based Method: The density-based method mainly focuses on density. In this method, the given cluster will keep on growing continuously as long as the density in the neighbourhood exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to contain at least a minimum number of points. Density-Based Clustering Methods: • DBSCAN, • OPTICS, • DENCLUE
  • 17. Density-Based Method DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database with outliers. OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order of database with respect to its density-based clustering structure. The order of the cluster comprises information equivalent to the density-based clustering related to a long range of parameter settings. OPTICS methods are beneficial for both automatic and interactive cluster analysis, including determining an intrinsic clustering structure. DENCLUE - Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical description of arbitrarily shaped clusters in high dimension state of data, and it is good for data sets with a huge amount of noise.
  • 18. Density-Based Method Major Features of Density-Based Clustering The primary features of Density-based clustering are given below. • It is a scan method. • It requires density parameters as a termination condition. • It is used to manage noise in data clusters. • Density-based clustering is used to identify clusters of arbitrary size.
  • 19. Grid-Based Method Grid-Based Method: In the Grid-Based method a grid is formed using the object together, i.e, the object space is quantized into a finite number of cells that form a grid structure. One of the major advantages of the grid-based method is fast processing time and it is dependent only on the number of cells in each dimension in the quantized space. The processing time for this method is much faster so it can save time. An instance of the grid-based approach involves STING, which explores statistical data stored in the grid cells, WaveCluster, which clusters objects using a wavelet transform approach, and CLIQUE, which defines a grid-and density-based approach for clustering in high-dimensional data space.
  • 20. Grid-Based Method • STING is a grid-based multiresolution clustering method in which the spatial area is divided into rectangular cells. • There are generally several levels of such rectangular cells corresponding to multiple levels of resolution, and these cells form a hierarchical mechanism each cell at a high level is separation to form several cells at the next lower level. • Statistical data regarding the attributes in each grid cell (including the mean, maximum, and minimum values) is precomputed and stored.
  • 21. Grid-Based Method Several interesting methods • STING (a STatistical INformation Grid approach) by Wang, Yang, and Muntz (1997) • WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) - A multi-resolution clustering approach using wavelet method • CLIQUE - Agrawal, et al. (SIGMOD’98)
  • 22. Model-Based Method Model-Based Method: In the model-based method, all the clusters are hypothesized in order to find the data which is best suited for the model. The clustering of the density function is used to locate the clusters for a given model. It reflects the spatial distribution of data points and also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. Therefore it yields robust clustering methods. Model-based clustering is a try to advance the fit between the given data and some mathematical model and is based on the assumption that data are created by a combination of a basic probability distribution.
  • 23. Model-Based Method There are the following types of model-based clustering are as follows − Statistical approach − Expectation maximization is a popular iterative refinement algorithm. An extension to k-means − It can assign each object to a cluster according to weight (probability distribution). New means are computed based on weight measures. The basic idea is as follows − • It can start with an initial estimate of the parameter vector. • It can be used to iteratively rescore the designs against the mixture density made by the parameter vector. • It is used to rescored patterns are used to update the parameter estimates. • It can be used to pattern belonging to the same cluster if they are placed by their scores in a particular component.
  • 24. Model-Based Method Machine learning approach − Machine learning is an approach that makes complex algorithms for huge data processing and supports results to its users. It uses complex programs that can understand through experience and create predictions. The algorithms are improved by themselves by frequent input of training information. The main objective of machine learning is to learn data and build models from data that can be understood and used by humans. It is a famous approach of incremental conceptual learning, which produces a hierarchical clustering in the form of a classification tree. Each node defines a concept and includes a probabilistic representation of that concept. Limitations: • The assumption that the attributes are independent of each other is often too strong because correlation can exist. • It is not suitable for clustering large database data, skewed trees, and expensive probability distributions.
  • 25. Model-Based Method Neural Network Approach − The neural network approach represents each cluster as an example, acting as a prototype of the cluster. The new objects are distributed to the cluster whose example is the most similar according to some distance measure. Since the quality of clustering is not only dependent on the distribution of data points but also on the learned representation, deep neural networks can be effective means to transform mappings from a high-dimensional data space into a lower-dimensional feature space, leading to improved clustering results.
  • 26. Constraint-Based Method Constraint-Based Method: The constraint-based clustering method is performed by the incorporation of application or user-oriented constraints. A constraint refers to the user expectation or the properties of the desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. The user or the application requirement can specify constraints.
  • 27. Constraint-Based Method Methods For Clustering With Constraints: There are various methods for clustering with constraints and can handle specific constraints: Handling Hard Constraints: There is a method for handling the hard constraints by regarding the constraint in a cluster assignment procedure. It is a very important method for handling the difficult constraints we can regard the constraints in the assignment procedure of cluster. Generating the super instances for must-link constraints: There are must-link constraints that have transitive closure that can be calculated by it. so that we can say that must-link constraints are known as an equivalence relation. The subset can be defined by it. In subset, there are some objects which can be replaced by the mean.
  • 28. Constraint-Based Method Handling the soft constraints: In the clustering process of soft constraints, there is always an optimization process. There is always a penalty requires in the clustering process. Hence the optimization in this process’s aim is to optimize the constraint violation and decreasing the clustering aspect. For example, if we take two sets one is of data sets and the other is a set of constraints, CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm, K-means clustering is enforced to constraint violation penalty. The main objective of CVQE is the total distance used for K-means which are used as follows: • Penalty in must-link violation • Penalty in cannot-link violation:
  • 29. There are several categories of constraints which are as follows − Constraints on individual objects Constraints on the selection of clustering parameters Constraints on distance or similarity functions User-specified constraints on the properties of individual clusters Semi-supervised clustering based on “partial” supervision
  • 30. Outlier Analysis “Outliers" refer to the data points that exist outside of what is to be expected. The major thing about the outliers is what you do with them. If you are going to analyze any task to analyze data sets, you will always have some assumptions based on how this data is generated. Difference between outliers and noise Any unwanted error occurs in some previously measured variable, or there is any variance in the previously measured variable called noise. Before finding the outliers present in any data set, it is recommended first to remove the noise. Types of Outliers: Outliers are divided into three different types 1. Global or point outliers 2. Collective outliers 3. Contextual or conditional outliers
  • 31. Outlier Analysis Outliers are discarded at many places when data mining is applied. But it is still used in many applications like fraud detection, medical, etc. It is usually because the events that occur rarely can store much more significant information than the events that occur more regularly. Other applications where outlier detection plays a vital role are given below. Any unusual response that occurs due to medical treatment can be analyzed through outlier analysis in data mining. • Fraud detection in the telecom industry • In market analysis, outlier analysis enables marketers to identify the customer's behaviors. • In the Medical analysis field. • Fraud detection in banking and finance such as credit cards, insurance sector, etc. The process in which the behavior of the outliers is identified in a dataset is called outlier analysis. It is also known as "outlier mining", the process is defined as a significant task of data mining.
  • 32. Applications Of Cluster Analysis It is widely used in image processing, data analysis, and pattern recognition. It helps marketers to find the distinct groups in their customer base and they can characterize their customer groups by using purchasing patterns. It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with the same capabilities. It also helps in information discovery by classifying documents on the web. In many applications, clustering analysis is widely used, such as data analysis, market research, pattern recognition, and image processing. It assists marketers to find different groups in their client base and based on the purchasing patterns. They can characterize their customer groups.
  • 33. Applications Of Cluster Analysis It helps in allocating documents on the internet for data discovery. Clustering is also used in tracking applications such as detection of credit card fraud. As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to analyze the characteristics of each cluster. In terms of biology, It can be used to determine plant and animal taxonomies, categorization of genes with the same functionalities and gain insight into structure inherent to populations. It helps in the identification of areas of similar land that are used in an earth observation database and the identification of house groups in a city according to house type, value, and geographical location.
  • 34. Advantages of Cluster Analysis  It can help identify patterns and relationships within a dataset that may not be immediately obvious.  It can be used for exploratory data analysis and can help with feature selection.  It can be used to reduce the dimensionality of the data.  It can be used for anomaly detection and outlier identification.  It can be used for market segmentation and customer profiling.
  • 35. Disadvantages of Cluster Analysis  It can be sensitive to the choice of initial conditions and the number of clusters.  It can be sensitive to the presence of noise or outliers in the data.  It can be difficult to interpret the results of the analysis if the clusters are not well-defined.  It can be computationally expensive for large datasets.  The results of the analysis can be affected by the choice of clustering algorithm used.  It is important to note that the success of cluster analysis depends on the data, the goals of the analysis, and the ability of the analyst to interpret the results.
  • 36. Additional Content on Cluster Analysis
  • 37. What is a Cluster? • A cluster is a subset of similar objects • A subset of objects such that the distance between any of the two objects in the cluster is less than the distance between any object in the cluster and any object that is not located inside it. • A connected region of a multidimensional space with a comparatively high density of objects.
  • 38. What is clustering in Data Mining? • Clustering is the method of converting a group of abstract objects into classes of similar objects. • Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters. • It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone instrument to get a better insight into data distribution or as a pre-processing step for other algorithms
  • 39. Important points • Data objects of a cluster can be considered as one group. • We first partition the information set into groups while doing cluster analysis. It is based on data similarities and then assigns the levels to the groups. • The over-classification main advantage is that it is adaptable to modifications, and it helps single out important characteristics that differentiate between distinct groups.
  • 40. Why is clustering used in data mining? • Clustering analysis has been an evolving problem in data mining due to its variety of applications. • The advent of various data clustering tools in the last few years and their comprehensive use in a broad range of applications, including image processing, computational biology, mobile communication, medicine, and economics, must contribute to the popularity of these algorithms. • The main issue with the data clustering algorithms is that it cant be standardized.
  • 41. Why is clustering used in data mining? • The advanced algorithm may give the best results with one type of data set, but it may fail or perform poorly with other kinds of data set. • Although many efforts have been made to standardize the algorithms that can perform well in all situations, no significant achievement has been achieved so far. • Many clustering tools have been proposed so far. However, each algorithm has its advantages or disadvantages and cant work on all real situations.