K-means Clustering

K-means is one of the most popular clustering algorithms. It partitions data into k clusters by minimizing the within-cluster sum of squared distances to cluster centroids.

Overview

The k-means algorithm iteratively:

Assigns each point to the nearest cluster centroid
Updates centroids to the mean of assigned points
Repeats until convergence or maximum iterations

Usage

using UnsupervisedClustering, Random

# Generate sample data
Random.seed!(42);
data = rand(100, 2);
k = 3;

# Create and run k-means
kmeans = Kmeans();
result = fit(kmeans, data, k);

result.objective

# output
6.911504212197341

API Reference

UnsupervisedClustering.BalancedKmeans — Type

BalancedKmeans(
    metric::SemiMetric = SqEuclidean()
    verbose::Bool = DEFAULT_VERBOSE
    rng::AbstractRNG = Random.GLOBAL_RNG
    tolerance::Float64 = DEFAULT_TOLERANCE
    max_iterations::Int = DEFAULT_MAX_ITERATIONS
)

The balanced kmeans is a variation of the k-means clustering algorithm that balances the number of data points assigned to each cluster.

Fields

metric: defines the distance metric used to compute the distances between data points and cluster centroids.
verbose: controls whether the algorithm should display additional information during execution.
rng: represents the random number generator to be used by the algorithm.
tolerance: represents the convergence criterion for the algorithm. It determines the maximum change allowed in the centroid positions between consecutive iterations.
max_iterations: represents the maximum number of iterations the algorithm will perform before stopping, even if convergence has not been reached.

source

UnsupervisedClustering.Kmeans — Type

Kmeans(
    metric::SemiMetric = SqEuclidean()
    verbose::Bool = DEFAULT_VERBOSE
    rng::AbstractRNG = Random.GLOBAL_RNG
    tolerance::Float64 = DEFAULT_TOLERANCE
    max_iterations::Int = DEFAULT_MAX_ITERATIONS
)

The k-means is a clustering algorithm that aims to partition data into clusters by minimizing the distances between data points and their cluster centroids.

Fields

metric: defines the distance metric used to compute the distances between data points and cluster centroids.
verbose: controls whether the algorithm should display additional information during execution.
rng: represents the random number generator to be used by the algorithm.
tolerance: represents the convergence criterion for the algorithm. It determines the maximum change allowed in the centroid positions between consecutive iterations.
max_iterations: represents the maximum number of iterations the algorithm will perform before stopping, even if convergence has not been reached.

References

Hartigan, John A., and Manchek A. Wong. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28.1 (1979): 100-108.
Lloyd, Stuart. Least squares quantization in PCM. IEEE transactions on information theory 28.2 (1982): 129-137.

source

UnsupervisedClustering.KmeansResult — Type

KmeansResult(
    assignments::AbstractVector{<:Integer}
    clusters::AbstractMatrix{<:Real}
    objective::Real
    objective_per_cluster::AbstractVector{<:Real}
    iterations::Integer
    elapsed::Real
    converged::Bool
    k::Integer
)

KmeansResult struct represents the result of the k-means clustering algorithm.

Fields

assignments: an integer vector that stores the cluster assignment for each data point.
clusters: a floating-point matrix representing the cluster's centroid.
objective: a floating-point number representing the objective function after running the algorithm. The objective function measures the quality of the clustering solution.
objective_per_cluster: a floating-point vector that stores the objective function of each cluster
iterations: an integer value indicating the number of iterations performed until the algorithm has converged or reached the maximum number of iterations
elapsed: a floating-point number representing the time in seconds for the algorithm to complete.
converged: indicates whether the algorithm has converged to a solution.
k: the number of clusters.

source

UnsupervisedClustering.fit! — Method

fit!(
    kmeans::AbstractKmeans,
    data::AbstractMatrix{<:Real},
    result::KmeansResult
)

The fit! function performs the k-means clustering algorithm on the given result as the initial point and updates the provided object with the clustering result.

Parameters:

kmeans: an instance representing the clustering settings and parameters.
data: a floating-point matrix, where each row represents a data point, and each column represents a feature.
result: a result object that will be updated with the clustering result.

Example

n = 100
d = 2
k = 2

data = rand(n, d)

kmeans = Kmeans()
result = KmeansResult(n, [1.0 2.0; 1.0 2.0])
fit!(kmeans, data, result)

source

UnsupervisedClustering.fit — Method

fit(
    kmeans::AbstractKmeans,
    data::AbstractMatrix{<:Real},
    initial_clusters::AbstractVector{<:Integer}
)

The fit function performs the k-means clustering algorithm on the given data points as the initial point and returns a result object representing the clustering result.

Parameters:

kmeans: an instance representing the clustering settings and parameters.
data: a floating-point matrix, where each row represents a data point, and each column represents a feature.
initial_clusters: an integer vector where each element is the initial data point for each cluster.

Example

n = 100
d = 2
k = 2

data = rand(n, d)

kmeans = Kmeans()
result = fit(kmeans, data, [4, 12])

source

UnsupervisedClustering.fit — Method

fit(
    kmeans::AbstractKmeans,
    data::AbstractMatrix{<:Real},
    k::Integer
)

The fit function performs the k-means clustering algorithm and returns a result object representing the clustering result.

Parameters:

kmeans: an instance representing the clustering settings and parameters.
data: a floating-point matrix, where each row represents a data point, and each column represents a feature.
k: an integer representing the number of clusters.

Example

n = 100
d = 2
k = 2

data = rand(n, d)

kmeans = Kmeans()
result = fit(kmeans, data, k)

source