Business Task
Data Pre-processing
1. Feature Selection
Data Visualization
1. PCA
2. Plot Data
Distance Computation
1. One-hot encoding
2. Metric Learning
3. Cosine
4. Euclidean Distance
Model Selection:
1. Centroid-based clustering (K-Means, K-medoids)
2. Connectivity-based clustering (hierarchical clustering)
3. Distribution-based clustering (Gaussian mixture models - using the expectation-maximization algorithm)
4. Density-based Clustering (DBSCAN, OPTICS)
5. Overlapping Clustering (Fuzzy C-means)
Model Evaluation:
1. Internal evaluation: a clustering result is evaluated based on the data that was clustered itself.
a. Davies–Bouldin index
b. Dunn index
2. External evaluation: clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks.
a. Purity
b. Rand measure
c. F-measure
d. Jaccard index
Model Optimization
1. Tune model
2. Modify model