Package edu.cmu.tetrad.search.unmix
Class EmUnmix
java.lang.Object
edu.cmu.tetrad.search.unmix.EmUnmix
The EmUnmix class provides functionality for applying the Expectation-Maximization (EM)
algorithm on residual signatures derived from a dataset to fit Gaussian mixtures.
It supports both single model fitting and model selection for optimal cluster count (K)
using criteria such as Bayesian Information Criterion (BIC). Additionally, the class
allows optional graph-based operations for pooled and per-cluster analysis.
This class is designed to work with complex datasets and employs residual regression
for building input data before fitting Gaussian mixtures. It also supports various
configurations for the EM algorithm and data preprocessing.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classConfiguration class for the EmUnmix algorithm, providing parameters and settings to control the behavior of the unmixing process. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic UnmixResultrun(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor) Runs the unmixing process on the provided dataset using the specified configuration and regressor.static UnmixResultrun(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor, Function<DataSet, Graph> pooledSearch, Function<DataSet, Graph> perClusterSearch) Executes the unmixing process on the given dataset using the specified configuration, residual regressor, and optional graph search functions.static UnmixResultselectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, EmUnmix.Config base) Selects the optimal number of clusters (K) for unmixing the provided dataset within the specified range.static UnmixResultselectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, Function<DataSet, Graph> pooledSearch, Function<DataSet, Graph> perClusterSearch, EmUnmix.Config base) Selects the optimal number of clusters (K) for unmixing a dataset within the specified range [Kmin, Kmax].
-
Constructor Details
-
EmUnmix
public EmUnmix()Default constructor for EmUnmix.
-
-
Method Details
-
run
Runs the unmixing process on the provided dataset using the specified configuration and regressor.- Parameters:
data- The dataset on which the unmixing process is performed. Contains data points to be clustered.cfg- The configuration object providing parameters and settings for the unmixing algorithm.regressor- The residual regressor used for determining cluster assignments and handling dependencies in data.- Returns:
- An instance of UnmixResult containing the results of the unmixing process, including cluster labels, per-cluster datasets, and optional cluster graphs.
-
selectK
public static UnmixResult selectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, EmUnmix.Config base) Selects the optimal number of clusters (K) for unmixing the provided dataset within the specified range. The method uses the specified residual regressor and configuration settings to determine the best cluster count.- Parameters:
data- The dataset to analyze for optimal clustering. Contains data points to be partitioned.Kmin- The minimum number of clusters to evaluate.Kmax- The maximum number of clusters to evaluate.regressor- The residual regressor used to fit the data and evaluate clustering performance.base- The base configuration used for running the clustering algorithm during evaluation.- Returns:
- An UnmixResult object containing the best clustering results, cluster labels, per-cluster datasets, and additional optional cluster information.
-
run
public static UnmixResult run(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor, Function<DataSet, Graph> pooledSearch, Function<DataSet, Graph> perClusterSearch) Executes the unmixing process on the given dataset using the specified configuration, residual regressor, and optional graph search functions. The method applies residual signature extraction, EM clustering, and optionally searches for cluster-specific graphical representations.- Parameters:
data- The dataset to be processed, containing data points to be partitioned into clusters.cfg- The configuration object that provides parameters and settings for the unmixing algorithm.regressor- The residual regressor used for generating residual signatures and handling dependencies in the data.pooledSearch- A function that builds a pooled graphical representation of the dataset, may be null if not needed.perClusterSearch- A function that builds graphical representations for individual clusters, may be null if not applicable.- Returns:
- An instance of UnmixResult containing the clustering results, including cluster labels, per-cluster datasets, and optionally cluster-specific graphs.
-
selectK
public static UnmixResult selectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, Function<DataSet, Graph> pooledSearch, Function<DataSet, Graph> perClusterSearch, EmUnmix.Config base) Selects the optimal number of clusters (K) for unmixing a dataset within the specified range [Kmin, Kmax]. The method evaluates clustering performance using a residual regressor and determines the best clustering configuration based on the Bayesian Information Criterion (BIC). The solution may optionally incorporate graphical searches for pooled or per-cluster representations.- Parameters:
data- The dataset to be analyzed for clustering. Contains data points to be partitioned.Kmin- The minimum number of clusters to evaluate. Must be at least 1.Kmax- The maximum number of clusters to evaluate. Cannot exceed the number of rows in the dataset.regressor- The residual regressor used for fitting the data and evaluating clustering performance.pooledSearch- A function for building a pooled graphical representation of the dataset. Can be null if not needed.perClusterSearch- A function for building graphical representations for individual clusters. Can be null if not applicable.base- The base configuration object containing parameters for clustering and residual calculation.- Returns:
- An UnmixResult object containing the clustering results, including the best number of clusters (K), cluster labels, per-cluster datasets, and optionally cluster-specific graphs.
- Throws:
IllegalArgumentException- If Kmin is less than 1 or if Kmax exceeds the number of rows in the dataset.
-