Gaussian Mixture Models (GMM)
Gaussian Mixture Models represent data as a mixture of Gaussian distributions, providing probabilistic cluster assignments and soft clustering capabilities. Unlike k-means, GMM can model elliptical clusters and provides uncertainty estimates.
Overview
GMM uses the Expectation-Maximization (EM) algorithm:
- E-step: Calculate probabilities of each point belonging to each Gaussian component
- M-step: Update Gaussian parameters (means, covariances, mixing weights) based on probabilities
- Repeat: Continue until convergence of log-likelihood
Usage
using UnsupervisedClustering, Random
# Generate sample data
Random.seed!(42);
data = rand(100, 2);
k = 3;
# Create covariance estimator (required for GMM)
n, d = size(data);
estimator = UnsupervisedClustering.EmpiricalCovarianceMatrix(n, d);
# Create and run GMM
gmm = GMM(estimator = estimator);
result = fit(gmm, data, k);
result.objective
# output
-0.43900384415707366API Reference
UnsupervisedClustering.GMM — TypeGMM(
estimator::CovarianceMatrixEstimator
verbose::Bool = DEFAULT_VERBOSE
rng::AbstractRNG = Random.GLOBAL_RNG
tolerance::Float64 = DEFAULT_TOLERANCE
max_iterations::Int = DEFAULT_MAX_ITERATIONS
decompose_if_fails::Bool = true
)The GMM is a clustering algorithm that models the underlying data distribution as a mixture of Gaussian distributions.
Fields
estimator: represents the method or algorithm used to estimate the covariance matrices in the GMM.verbose: controls whether the algorithm should display additional information during execution.rng: represents the random number generator to be used by the algorithm.tolerance: represents the convergence criterion for the algorithm. It determines the maximum change allowed in the model's log-likelihood between consecutive iterations before considering convergence.max_iterations: represents the maximum number of iterations the algorithm will perform before stopping, even if convergence has not been reached.decompose_if_fails: determines whether the algorithm should attempt to decompose the covariance matrix of a component and fix its eigenvalues if the decomposition fails due to numerical issues.
References
- Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological) 39.1 (1977): 1-22.
UnsupervisedClustering.GMMResult — TypeGMMResult(
assignments::AbstractVector{<:Integer}
weights::AbstractVector{<:Real}
clusters::AbstractVector{<:AbstractVector{<:Real}}
covariances::AbstractVector{<:Symmetric{<:Real}}
objective::Real
iterations::Integer
elapsed::Real
converged::Bool
k::Integer
)GMMResult struct represents the result of the GMM clustering algorithm.
Fields
assignments: an integer vector that stores the cluster assignment for each data point.weights: a vector of floating-point numbers representing the weights associated with each cluster. The weight indicates the probability of a data point belonging to its respective cluster.clusters: a vector of floating-point vectors representing the cluster's centroid.covariances: a vector of symmetric matrices, where each matrix represents the covariance matrix of a cluster in the GMM model. The covariance matrix describes the shape and orientation of the data distribution within each cluster.objective: a floating-point number representing the objective function after running the algorithm. The objective function measures the quality of the clustering solution.iterations: an integer value indicating the number of iterations performed until the algorithm has converged or reached the maximum number of iterationselapsed: a floating-point number representing the time in seconds for the algorithm to complete.converged: indicates whether the algorithm has converged to a solution.k: the number of clusters.
UnsupervisedClustering.fit! — Methodfit!(
gmm::GMM,
data::AbstractMatrix{<:Real},
result::GMMResult
)The fit! function performs the GMM clustering algorithm on the given result as the initial point and updates the provided object with the clustering result.
Parameters:
gmm: an instance representing the clustering settings and parameters.data: a floating-point matrix, where each row represents a data point, and each column represents a feature.result: a result object that will be updated with the clustering result.
Example
n = 100
d = 2
k = 2
data = rand(n, d)
gmm = GMM(estimator = EmpiricalCovarianceMatrix(n, d))
result = GMMResult(n, [[1.0, 1.0], [2.0, 2.0]])
fit!(gmm, data, result)UnsupervisedClustering.fit — Methodfit(
gmm::GMM,
data::AbstractMatrix{<:Real},
initial_clusters::AbstractVector{<:Integer}
)The fit function performs the GMM clustering algorithm on the given data points as the initial point and returns a result object representing the clustering result.
Parameters:
kmeans: an instance representing the clustering settings and parameters.data: a floating-point matrix, where each row represents a data point, and each column represents a feature.initial_clusters: an integer vector where each element is the initial data point for each cluster.
Example
n = 100
d = 2
k = 2
data = rand(n, d)
gmm = GMM(estimator = EmpiricalCovarianceMatrix(n, d))
result = fit(gmm, data, [4, 12])UnsupervisedClustering.fit — Methodfit(
gmm::GMM,
data::AbstractMatrix{<:Real},
k::Integer
)The fit function performs the GMM clustering algorithm and returns a result object representing the clustering result.
Parameters:
gmm: an instance representing the clustering settings and parameters.data: a floating-point matrix, where each row represents a data point, and each column represents a feature.k: an integer representing the number of clusters.
Example
n = 100
d = 2
k = 2
data = rand(n, d)
gmm = GMM(estimator = EmpiricalCovarianceMatrix(n, d))
result = fit(gmm, data, k)