Machine Learning Clustering Problems Workflow

Business Task

Data Pre-processing

1. Feature Selection

Data Visualization

1. PCA
2. Plot Data

Distance Computation

1. One-hot encoding
2. Metric Learning
3. Cosine
4. Euclidean Distance

Model Selection:

1. Centroid-based clustering (K-Means, K-medoids)
2. Connectivity-based clustering (hierarchical clustering)
3. Distribution-based clustering (Gaussian mixture models - using the expectation-maximization algorithm)
4. Density-based Clustering (DBSCAN, OPTICS)
5. Overlapping Clustering (Fuzzy C-means)

Model Evaluation:

1. Internal evaluation:  a clustering result is evaluated based on the data that was clustered itself.
	a. Davies–Bouldin index
	b. Dunn index
2. External evaluation: clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks.
	a. Purity
	b. Rand measure
	c. F-measure
	d. Jaccard index

Model Optimization

1. Tune model
2. Modify model