Gaussian Mixture Models (GMM)

Gaussian Mixture Models represent data as a mixture of Gaussian distributions, providing probabilistic cluster assignments and soft clustering capabilities. Unlike k-means, GMM can model elliptical clusters and provides uncertainty estimates.

Overview

GMM uses the Expectation-Maximization (EM) algorithm:

E-step: Calculate probabilities of each point belonging to each Gaussian component
M-step: Update Gaussian parameters (means, covariances, mixing weights) based on probabilities
Repeat: Continue until convergence of log-likelihood

Usage

using UnsupervisedClustering, Random

# Generate sample data
Random.seed!(42);
data = rand(100, 2);
k = 3;

# Create covariance estimator (required for GMM)
n, d = size(data);
estimator = UnsupervisedClustering.EmpiricalCovarianceMatrix(n, d);

# Create and run GMM
gmm = GMM(estimator = estimator);
result = fit(gmm, data, k);

result.objective

# output
-0.43900384415707366

API Reference

UnsupervisedClustering.GMM — Type

GMM(
    estimator::CovarianceMatrixEstimator
    verbose::Bool = DEFAULT_VERBOSE
    rng::AbstractRNG = Random.GLOBAL_RNG
    tolerance::Float64 = DEFAULT_TOLERANCE
    max_iterations::Int = DEFAULT_MAX_ITERATIONS
    decompose_if_fails::Bool = true
)

The GMM is a clustering algorithm that models the underlying data distribution as a mixture of Gaussian distributions.

Fields

estimator: represents the method or algorithm used to estimate the covariance matrices in the GMM.
verbose: controls whether the algorithm should display additional information during execution.
rng: represents the random number generator to be used by the algorithm.
tolerance: represents the convergence criterion for the algorithm. It determines the maximum change allowed in the model's log-likelihood between consecutive iterations before considering convergence.
max_iterations: represents the maximum number of iterations the algorithm will perform before stopping, even if convergence has not been reached.
decompose_if_fails: determines whether the algorithm should attempt to decompose the covariance matrix of a component and fix its eigenvalues if the decomposition fails due to numerical issues.

References

Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological) 39.1 (1977): 1-22.

source

UnsupervisedClustering.GMMResult — Type

GMMResult(
    assignments::AbstractVector{<:Integer}
    weights::AbstractVector{<:Real}
    clusters::AbstractVector{<:AbstractVector{<:Real}}
    covariances::AbstractVector{<:Symmetric{<:Real}}
    objective::Real
    iterations::Integer
    elapsed::Real
    converged::Bool
    k::Integer
)

GMMResult struct represents the result of the GMM clustering algorithm.

Fields

assignments: an integer vector that stores the cluster assignment for each data point.
weights: a vector of floating-point numbers representing the weights associated with each cluster. The weight indicates the probability of a data point belonging to its respective cluster.
clusters: a vector of floating-point vectors representing the cluster's centroid.
covariances: a vector of symmetric matrices, where each matrix represents the covariance matrix of a cluster in the GMM model. The covariance matrix describes the shape and orientation of the data distribution within each cluster.
objective: a floating-point number representing the objective function after running the algorithm. The objective function measures the quality of the clustering solution.
iterations: an integer value indicating the number of iterations performed until the algorithm has converged or reached the maximum number of iterations
elapsed: a floating-point number representing the time in seconds for the algorithm to complete.
converged: indicates whether the algorithm has converged to a solution.
k: the number of clusters.

source

UnsupervisedClustering.fit! — Method

fit!(
    gmm::GMM,
    data::AbstractMatrix{<:Real},
    result::GMMResult
)

The fit! function performs the GMM clustering algorithm on the given result as the initial point and updates the provided object with the clustering result.

Parameters:

gmm: an instance representing the clustering settings and parameters.
data: a floating-point matrix, where each row represents a data point, and each column represents a feature.
result: a result object that will be updated with the clustering result.

Example

n = 100
d = 2
k = 2

data = rand(n, d)

gmm = GMM(estimator = EmpiricalCovarianceMatrix(n, d))
result = GMMResult(n, [[1.0, 1.0], [2.0, 2.0]])
fit!(gmm, data, result)

source

UnsupervisedClustering.fit — Method

fit(
    gmm::GMM,
    data::AbstractMatrix{<:Real},
    initial_clusters::AbstractVector{<:Integer}
)

The fit function performs the GMM clustering algorithm on the given data points as the initial point and returns a result object representing the clustering result.

Parameters:

kmeans: an instance representing the clustering settings and parameters.
data: a floating-point matrix, where each row represents a data point, and each column represents a feature.
initial_clusters: an integer vector where each element is the initial data point for each cluster.

Example

n = 100
d = 2
k = 2

data = rand(n, d)

gmm = GMM(estimator = EmpiricalCovarianceMatrix(n, d))
result = fit(gmm, data, [4, 12])

source

UnsupervisedClustering.fit — Method

fit(
    gmm::GMM,
    data::AbstractMatrix{<:Real},
    k::Integer
)

The fit function performs the GMM clustering algorithm and returns a result object representing the clustering result.

Parameters:

gmm: an instance representing the clustering settings and parameters.
data: a floating-point matrix, where each row represents a data point, and each column represents a feature.
k: an integer representing the number of clusters.

Example

n = 100
d = 2
k = 2

data = rand(n, d)

gmm = GMM(estimator = EmpiricalCovarianceMatrix(n, d))
result = fit(gmm, data, k)

source