SlideShare a Scribd company logo
1 of 43
Cluster analysis
• In cluster analysis we search for patterns in a data
set by grouping the (multivariate) observations
into clusters.
• The goal is to find an optimal grouping for which
the observations or objects within each cluster
are similar, but the clusters are dissimilar to each
other.
• We hope to find the natural groupings in the
data, groupings that make sense to the
researcher.
Cluster analysis . . . is a group of multivariate
techniques whose primary purpose is to group
objects based on the characteristics they possess.
• It has been referred to as Q analysis, typology
construction, classification analysis, and numerical
taxonomy.
• The essence of all clustering approaches is the
classification of data as suggested by “natural”
groupings of the data themselves.
What is Cluster Analysis?
Between-Cluster Variation = Maximize
Within-Cluster Variation = Minimize
Three Cluster Diagram Showing
Between-Cluster and Within-Cluster Variation
9-5
High
Low
Low High
Frequency
of
eating
out
Frequency of going to fast food restaurants
Scatter Diagram for Cluster
Observations
9-6
High
Low
Low High
Scatter Diagram for Cluster Observations
Frequency of going to fast food restaurants
Frequency
of
eating
out
9-7
High
Low
Low High
Scatter Diagram for Cluster Observations
Frequency of going to fast food restaurants
Frequency
of
eating
out
9-8
High
Low
Low High
Frequency of going to fast food restaurants
Frequency
of
eating
out
Scatter Diagram for Cluster Observations
The following must be addressed by
conceptual rather than empirical support:
• Cluster analysis is descriptive, atheoretical, and
noninferential.
• . . . will always create clusters, regardless of the
actual existence of any structure in the data.
• The cluster solution is not generalizable because it
is totally dependent upon the variables used as
the basis for the similarity measure.
Criticisms of Cluster Analysis
What Can We Do With
Cluster Analysis?
1. Determine if statistically different clusters
exist.
2. Identify the meaning of the clusters.
3. Explain how the clusters can be used.
The primary objective of cluster analysis is to define the
structure of the data by placing the most similar
observations into groups. To do so, we must answer
three questions:
• How do we measure similarity?
• How do we form clusters?
• How many groups do we form?
Research Questions in Cluster
Analysis
Primary Goal = to partition a set of objects into two or
more groups based on the similarity of the objects for a
set of specified characteristics (the cluster variate).
Two key issues:
• The research questions being addressed, and
• The variables used to characterize objects in the
clustering process.
Stage 1: Objectives of Cluster
Analysis
Three basic questions . . .
• How to form the taxonomy – an empirically
based classification of objects.
• How to simplify the data – by grouping
observations for further analysis.
• Which relationships can be identified – the
process reveals relationships among the
observations.
Other Research Questions ?
Two Issues . . .
1. Conceptual considerations – include only variables
that . . .
– Characterize the objects being clustered
– Relate specifically to the objectives of the
cluster analysis
2. Practical considerations.
Selecting Cluster Variables
Rules of Thumb 9–1
OBJECTIVES OF CLUSTER ANALYSIS
 Cluster analysis is used for:
 Taxonomy description – identifying natural groups within the
data.
 Data simplification – the ability to analyze groups of similar
observations instead of all individual observations.
 Relationship identification – the simplified structure from cluster
analysis portrays relationships not revealed otherwise.
 Theoretical, conceptual and practical considerations must be
observed when selecting clustering variables for cluster analysis:
 Only variables that relate specifically to objectives of the cluster
analysis are included, since “irrelevant” variables can not be
excluded from the analysis once it begins
 Variables are selected which characterize the individuals
(objects) being clustered
Four Questions . . .
• Is the sample size adequate?
• Can outliers be detected an, if so, should they be
deleted?
• How should object similarity be measured?
• Should the data be standardized?
Stage 2: Research Design in
Cluster Analysis
Measuring Similarity
Interobject similarity is an empirical measure of
correspondence, or resemblance, between objects to be
clustered.
It can be measured in a variety of ways, but a
convenient measure of proximity is the distance
between two observations.
Since a distance increases as two units become further
apart, distance is actually a measure of dissimilarity.
Types of Distance Measures
• Euclidean distance
• Squared (or absolute) Euclidean distance
• City-block (Manhattan) distance
• Chebychev distance
• Mahalanobis distance (D2)
Euclidean distance
9-20
Exercise
• Three items have the following bivariate
measurements (y1, y2): (2, 5), (4, 2), (7, 9).
• Make an proximity matrix of Euclidean
distance.
• What happen if the scale in y1 is multiplied by
100 (e.g. changing from cm to m)
Exercise
• Determine Euclidean distance between
Atlanta and Boston.
Rules of Thumb 9 – 2
Research Design in Cluster Analysis
• The sample size required is not based on statistical
considerations for inference testing, but rather:
 Sufficient size is needed to ensure representativeness
of the population and its underlying structure,
particularly small groups within the population.
 Minimum group sizes are based on the relevance of
each group to the research question and the
confidence needed in characterizing that group.
Rules of Thumb 9 – 2 continued . . .
Research Design in Cluster Analysis
• Similarity measures calculated across the entire set of clustering variables
allow for the grouping of observations and their comparison to each other.
Distance measures are most often used as a measure of similarity, with
higher values representing greater dissimilarity (distance between cases)
not similarity.
 There are many different distance measures, including:
 Euclidean (straight line) distance is the most common measure of
distance.
 Squared Euclidean distance is the sum of squared distances and is
the recommended measure for the centroid and Ward’s methods of
clustering.
 Mahalanobis distance accounts for variable intercorrelations and
weights each variable equally. When variables are highly
intercorrelated, Mahalanobis distance is most appropriate.
Less frequently used are correlational measures, where large values do
indicate similarity.
Research Design in Cluster Analysis
• Given the sensitivity of some procedures to the similarity measure used, the
researcher should employ several distance measures and compare the
results from each with other results or theoretical/known patterns.
• Outliers can severely distort the representativeness of the results if they
appear as structure (clusters) that are inconsistent with the research
objectives
 They should be removed if the outlier represents:
 Aberrant observations not representative of the population
 Observations of small or insignificant segments within the
population which are of no interest to the research objectives
 They should be retained if representing an under-sampling/poor
representation of relevant groups in the population. In this case, the
sample should be augmented to ensure representation of these groups.
Rules of Thumb 9 – 2 Continued . . .
Research Design in Cluster Analysis
• Outliers can be identified based on the similarity measure by:
Finding observations with large distances from all other observations
Graphic profile diagrams highlighting outlying cases
Their appearance in cluster solutions as single-member or very small
clusters
• Clustering variables should be standardized whenever possible to avoid
problems resulting from the use of different scale values among clustering
variables.
The most common standardization conversion is Z scores.
If groups are to be identified according to an individual’s response style,
then within-case or row-centering standardization is appropriate.
Rules of Thumb 9 – 2 Continued . . .
• Representativeness of the sample.
• Impact of multicollinearity.
Stage 3: Assumptions of
Cluster Analysis
ASSUMPTIONS IN CLUSTER ANALYSIS
• Input variables should be examined for substantial
multicollinearity and if present . . .
Reduce the variables to equal numbers in each set
of correlated measures.
Use a distance measure that compensates for the
correlation, like Mahalanobis Distance.
Take a proactive approach and include only cluster
variables that are not highly correlated.
Rules of Thumb 9 – 3
The researcher must . . .
• Select the partitioning procedure used for
forming clusters
 Hierarchical
 Non-hierarchical
• Decide on the number of clusters to be
formed.
Stage 4: Deriving Clusters and Assessing
Overall Fit
Two Types of Hierarchical
Clustering Procedures
1. Agglomerative Methods (buildup)
2. Divisive Methods (breakdown)
How Agglomerative Hierarchical Approaches
Work?
• Start with all observations as their own cluster.
• Using the selected similarity measure, combine the
two most similar observations into a new cluster, now
containing two observations.
• Repeat the clustering procedure using the similarity
measure to combine the two most similar
observations or combinations of observations into
another new cluster.
• Continue the process until all observations are in a
single cluster.
Agglomerative Algorithms
• Single Linkage (nearest neighbor)
• Complete Linkage (farthest neighbor)
• Average Linkage.
• Centroid Method.
• Median method.
• Ward’s Method.
Single Linkage (nearest neighbor)
Complete Linkage (farthest neighbor)
Average Linkage
Centroid method
Median method
C1 358.7Den-Det
C2 447.4Bos-Chi
C3 464.5
Den-Det-
Dal
C4 516.4
Den-Det-
Dal-Atl
C5 590.2
Den-Det-
Dal-Atl-
Bos-Chi
Den
Det
Dal
Bos
Chi
Atl

More Related Content

Similar to 01 Statistika Lanjut - Cluster Analysis part 1 with sound (1).pptx

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis緯鈞 沈
 
clustering-151017180103-lva1-app6892 (1).pdf
clustering-151017180103-lva1-app6892 (1).pdfclustering-151017180103-lva1-app6892 (1).pdf
clustering-151017180103-lva1-app6892 (1).pdfprasad761467
 
cluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhjcluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhjKaranSingh784447
 
Unsupervised Learning-Clustering Algorithms.pptx
Unsupervised Learning-Clustering Algorithms.pptxUnsupervised Learning-Clustering Algorithms.pptx
Unsupervised Learning-Clustering Algorithms.pptxjasontseng19
 
Cluster_saumitra.ppt
Cluster_saumitra.pptCluster_saumitra.ppt
Cluster_saumitra.pptssuser6b3336
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statisticsjasondroesch
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptxRevathy V R
 
CLuster analysis presentation.pptx
CLuster analysis presentation.pptxCLuster analysis presentation.pptx
CLuster analysis presentation.pptxSAJANVERMA4
 
Biostatistics and Research Methodology Semester 8
Biostatistics and Research Methodology Semester 8Biostatistics and Research Methodology Semester 8
Biostatistics and Research Methodology Semester 8ParulSharma130721
 

Similar to 01 Statistika Lanjut - Cluster Analysis part 1 with sound (1).pptx (20)

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
clustering-151017180103-lva1-app6892 (1).pdf
clustering-151017180103-lva1-app6892 (1).pdfclustering-151017180103-lva1-app6892 (1).pdf
clustering-151017180103-lva1-app6892 (1).pdf
 
Clustering
ClusteringClustering
Clustering
 
cluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhjcluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhj
 
L7PDF.pdf
L7PDF.pdfL7PDF.pdf
L7PDF.pdf
 
Unsupervised Learning-Clustering Algorithms.pptx
Unsupervised Learning-Clustering Algorithms.pptxUnsupervised Learning-Clustering Algorithms.pptx
Unsupervised Learning-Clustering Algorithms.pptx
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster_saumitra.ppt
Cluster_saumitra.pptCluster_saumitra.ppt
Cluster_saumitra.ppt
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
How to choose a sample
How to choose a sampleHow to choose a sample
How to choose a sample
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
CLuster analysis presentation.pptx
CLuster analysis presentation.pptxCLuster analysis presentation.pptx
CLuster analysis presentation.pptx
 
Biostatistics and Research Methodology Semester 8
Biostatistics and Research Methodology Semester 8Biostatistics and Research Methodology Semester 8
Biostatistics and Research Methodology Semester 8
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 

Recently uploaded

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

01 Statistika Lanjut - Cluster Analysis part 1 with sound (1).pptx

  • 2. • In cluster analysis we search for patterns in a data set by grouping the (multivariate) observations into clusters. • The goal is to find an optimal grouping for which the observations or objects within each cluster are similar, but the clusters are dissimilar to each other. • We hope to find the natural groupings in the data, groupings that make sense to the researcher.
  • 3. Cluster analysis . . . is a group of multivariate techniques whose primary purpose is to group objects based on the characteristics they possess. • It has been referred to as Q analysis, typology construction, classification analysis, and numerical taxonomy. • The essence of all clustering approaches is the classification of data as suggested by “natural” groupings of the data themselves. What is Cluster Analysis?
  • 4. Between-Cluster Variation = Maximize Within-Cluster Variation = Minimize Three Cluster Diagram Showing Between-Cluster and Within-Cluster Variation
  • 5. 9-5 High Low Low High Frequency of eating out Frequency of going to fast food restaurants Scatter Diagram for Cluster Observations
  • 6. 9-6 High Low Low High Scatter Diagram for Cluster Observations Frequency of going to fast food restaurants Frequency of eating out
  • 7. 9-7 High Low Low High Scatter Diagram for Cluster Observations Frequency of going to fast food restaurants Frequency of eating out
  • 8. 9-8 High Low Low High Frequency of going to fast food restaurants Frequency of eating out Scatter Diagram for Cluster Observations
  • 9. The following must be addressed by conceptual rather than empirical support: • Cluster analysis is descriptive, atheoretical, and noninferential. • . . . will always create clusters, regardless of the actual existence of any structure in the data. • The cluster solution is not generalizable because it is totally dependent upon the variables used as the basis for the similarity measure. Criticisms of Cluster Analysis
  • 10. What Can We Do With Cluster Analysis? 1. Determine if statistically different clusters exist. 2. Identify the meaning of the clusters. 3. Explain how the clusters can be used.
  • 11. The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into groups. To do so, we must answer three questions: • How do we measure similarity? • How do we form clusters? • How many groups do we form? Research Questions in Cluster Analysis
  • 12. Primary Goal = to partition a set of objects into two or more groups based on the similarity of the objects for a set of specified characteristics (the cluster variate). Two key issues: • The research questions being addressed, and • The variables used to characterize objects in the clustering process. Stage 1: Objectives of Cluster Analysis
  • 13. Three basic questions . . . • How to form the taxonomy – an empirically based classification of objects. • How to simplify the data – by grouping observations for further analysis. • Which relationships can be identified – the process reveals relationships among the observations. Other Research Questions ?
  • 14. Two Issues . . . 1. Conceptual considerations – include only variables that . . . – Characterize the objects being clustered – Relate specifically to the objectives of the cluster analysis 2. Practical considerations. Selecting Cluster Variables
  • 15. Rules of Thumb 9–1 OBJECTIVES OF CLUSTER ANALYSIS  Cluster analysis is used for:  Taxonomy description – identifying natural groups within the data.  Data simplification – the ability to analyze groups of similar observations instead of all individual observations.  Relationship identification – the simplified structure from cluster analysis portrays relationships not revealed otherwise.  Theoretical, conceptual and practical considerations must be observed when selecting clustering variables for cluster analysis:  Only variables that relate specifically to objectives of the cluster analysis are included, since “irrelevant” variables can not be excluded from the analysis once it begins  Variables are selected which characterize the individuals (objects) being clustered
  • 16. Four Questions . . . • Is the sample size adequate? • Can outliers be detected an, if so, should they be deleted? • How should object similarity be measured? • Should the data be standardized? Stage 2: Research Design in Cluster Analysis
  • 17. Measuring Similarity Interobject similarity is an empirical measure of correspondence, or resemblance, between objects to be clustered. It can be measured in a variety of ways, but a convenient measure of proximity is the distance between two observations. Since a distance increases as two units become further apart, distance is actually a measure of dissimilarity.
  • 18. Types of Distance Measures • Euclidean distance • Squared (or absolute) Euclidean distance • City-block (Manhattan) distance • Chebychev distance • Mahalanobis distance (D2)
  • 20. 9-20
  • 21.
  • 22. Exercise • Three items have the following bivariate measurements (y1, y2): (2, 5), (4, 2), (7, 9). • Make an proximity matrix of Euclidean distance. • What happen if the scale in y1 is multiplied by 100 (e.g. changing from cm to m)
  • 23.
  • 24.
  • 25. Exercise • Determine Euclidean distance between Atlanta and Boston.
  • 26.
  • 27. Rules of Thumb 9 – 2 Research Design in Cluster Analysis • The sample size required is not based on statistical considerations for inference testing, but rather:  Sufficient size is needed to ensure representativeness of the population and its underlying structure, particularly small groups within the population.  Minimum group sizes are based on the relevance of each group to the research question and the confidence needed in characterizing that group.
  • 28. Rules of Thumb 9 – 2 continued . . . Research Design in Cluster Analysis • Similarity measures calculated across the entire set of clustering variables allow for the grouping of observations and their comparison to each other. Distance measures are most often used as a measure of similarity, with higher values representing greater dissimilarity (distance between cases) not similarity.  There are many different distance measures, including:  Euclidean (straight line) distance is the most common measure of distance.  Squared Euclidean distance is the sum of squared distances and is the recommended measure for the centroid and Ward’s methods of clustering.  Mahalanobis distance accounts for variable intercorrelations and weights each variable equally. When variables are highly intercorrelated, Mahalanobis distance is most appropriate. Less frequently used are correlational measures, where large values do indicate similarity.
  • 29. Research Design in Cluster Analysis • Given the sensitivity of some procedures to the similarity measure used, the researcher should employ several distance measures and compare the results from each with other results or theoretical/known patterns. • Outliers can severely distort the representativeness of the results if they appear as structure (clusters) that are inconsistent with the research objectives  They should be removed if the outlier represents:  Aberrant observations not representative of the population  Observations of small or insignificant segments within the population which are of no interest to the research objectives  They should be retained if representing an under-sampling/poor representation of relevant groups in the population. In this case, the sample should be augmented to ensure representation of these groups. Rules of Thumb 9 – 2 Continued . . .
  • 30. Research Design in Cluster Analysis • Outliers can be identified based on the similarity measure by: Finding observations with large distances from all other observations Graphic profile diagrams highlighting outlying cases Their appearance in cluster solutions as single-member or very small clusters • Clustering variables should be standardized whenever possible to avoid problems resulting from the use of different scale values among clustering variables. The most common standardization conversion is Z scores. If groups are to be identified according to an individual’s response style, then within-case or row-centering standardization is appropriate. Rules of Thumb 9 – 2 Continued . . .
  • 31. • Representativeness of the sample. • Impact of multicollinearity. Stage 3: Assumptions of Cluster Analysis
  • 32. ASSUMPTIONS IN CLUSTER ANALYSIS • Input variables should be examined for substantial multicollinearity and if present . . . Reduce the variables to equal numbers in each set of correlated measures. Use a distance measure that compensates for the correlation, like Mahalanobis Distance. Take a proactive approach and include only cluster variables that are not highly correlated. Rules of Thumb 9 – 3
  • 33. The researcher must . . . • Select the partitioning procedure used for forming clusters  Hierarchical  Non-hierarchical • Decide on the number of clusters to be formed. Stage 4: Deriving Clusters and Assessing Overall Fit
  • 34. Two Types of Hierarchical Clustering Procedures 1. Agglomerative Methods (buildup) 2. Divisive Methods (breakdown)
  • 35.
  • 36. How Agglomerative Hierarchical Approaches Work? • Start with all observations as their own cluster. • Using the selected similarity measure, combine the two most similar observations into a new cluster, now containing two observations. • Repeat the clustering procedure using the similarity measure to combine the two most similar observations or combinations of observations into another new cluster. • Continue the process until all observations are in a single cluster.
  • 37. Agglomerative Algorithms • Single Linkage (nearest neighbor) • Complete Linkage (farthest neighbor) • Average Linkage. • Centroid Method. • Median method. • Ward’s Method.
  • 43. C1 358.7Den-Det C2 447.4Bos-Chi C3 464.5 Den-Det- Dal C4 516.4 Den-Det- Dal-Atl C5 590.2 Den-Det- Dal-Atl- Bos-Chi Den Det Dal Bos Chi Atl