K-means++ Clustering

K-means++ is an improved version of k-means that uses a smarter initialization strategy. It selects initial centroids with probability proportional to their squared distance from existing centroids, leading to better clustering quality and faster convergence.

Overview

K-means++ uses the same iterative process as standard k-means but with enhanced initialization:

  1. Smart Initialization: Select centroids using distance-proportional probability
  2. Assignment: Assign points to nearest centroids
  3. Update: Recalculate centroids as cluster means
  4. Repeat: Continue until convergence

Usage

using UnsupervisedClustering, Random

# Generate sample data
Random.seed!(42);
data = rand(100, 2);
k = 3;

# Create and run K-means++
kmeanspp = KmeansPlusPlus();
result = fit(kmeanspp, data, k);

result.objective

# output
6.668404208820978

API Reference

UnsupervisedClustering.KmeansPlusPlusType
KmeansPlusPlus{RNG}(
    metric::SemiMetric = SqEuclidean()
    verbose::Bool = DEFAULT_VERBOSE
    rng::RNG = Random.GLOBAL_RNG
    tolerance::Float64 = DEFAULT_TOLERANCE
    max_iterations::Int = DEFAULT_MAX_ITERATIONS
) where {RNG <: AbstractRNG}

K-means++ is an improvement over the standard K-means algorithm that provides better initialization by selecting initial centroids with probability proportional to their squared distance from existing centroids. This typically leads to better clustering results and faster convergence.

Type Parameters

  • RNG: the specific type of the random number generator

Fields

  • metric: defines the distance metric used to compute the distances between data points and cluster centroids.
  • verbose: controls whether the algorithm should display additional information during execution.
  • rng: represents the random number generator to be used by the algorithm.
  • tolerance: represents the convergence criterion for the algorithm. It determines the maximum change allowed in the centroid positions between consecutive iterations.
  • max_iterations: represents the maximum number of iterations the algorithm will perform before stopping, even if convergence has not been reached.

References

  • Arthur, David, and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007.
source