edu.cmu.tetrad.search.unmix.EmUnmix

public final class EmUnmix extends Object

The EmUnmix class provides functionality for applying the Expectation-Maximization (EM) algorithm on residual signatures derived from a dataset to fit Gaussian mixtures. It supports both single model fitting and model selection for optimal cluster count (K) using criteria such as Bayesian Information Criterion (BIC). Additionally, the class allows optional graph-based operations for pooled and per-cluster analysis. This class is designed to work with complex datasets and employs residual regression for building input data before fitting Gaussian mixtures. It also supports various configurations for the EM algorithm and data preprocessing.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static final class

EmUnmix.Config

Configuration class for the EmUnmix algorithm, providing parameters and settings to control the behavior of the unmixing process.
Constructor Summary

Constructors

Constructor

Description

EmUnmix()

Default constructor for EmUnmix.
Method Summary

Modifier and Type

Method

Description

static UnmixResult

run(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor)

Runs the unmixing process on the provided dataset using the specified configuration and regressor.

static UnmixResult

run(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor, Function<DataSet,Graph> pooledSearch, Function<DataSet,Graph> perClusterSearch)

Executes the unmixing process on the given dataset using the specified configuration, residual regressor, and optional graph search functions.

static UnmixResult

selectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, EmUnmix.Config base)

Selects the optimal number of clusters (K) for unmixing the provided dataset within the specified range.

static UnmixResult

selectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, Function<DataSet,Graph> pooledSearch, Function<DataSet,Graph> perClusterSearch, EmUnmix.Config base)

Selects the optimal number of clusters (K) for unmixing a dataset within the specified range [Kmin, Kmax].

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- EmUnmix
  
  public EmUnmix()
  
  Default constructor for EmUnmix.
Method Details
- run
  
  public static UnmixResult run(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor)
  
  Runs the unmixing process on the provided dataset using the specified configuration and regressor.
  
  Parameters:
  
  data - The dataset on which the unmixing process is performed. Contains data points to be clustered.
  
  cfg - The configuration object providing parameters and settings for the unmixing algorithm.
  
  regressor - The residual regressor used for determining cluster assignments and handling dependencies in data.
  
  Returns:
  
  An instance of UnmixResult containing the results of the unmixing process, including cluster labels, per-cluster datasets, and optional cluster graphs.
- selectK
  
  public static UnmixResult selectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, EmUnmix.Config base)
  
  Selects the optimal number of clusters (K) for unmixing the provided dataset within the specified range. The method uses the specified residual regressor and configuration settings to determine the best cluster count.
  
  Parameters:
  
  data - The dataset to analyze for optimal clustering. Contains data points to be partitioned.
  
  Kmin - The minimum number of clusters to evaluate.
  
  Kmax - The maximum number of clusters to evaluate.
  
  regressor - The residual regressor used to fit the data and evaluate clustering performance.
  
  base - The base configuration used for running the clustering algorithm during evaluation.
  
  Returns:
  
  An UnmixResult object containing the best clustering results, cluster labels, per-cluster datasets, and additional optional cluster information.
- run
  
  public static UnmixResult run(DataSet data, EmUnmix.Config cfg, ResidualRegressor regressor, Function<DataSet,Graph> pooledSearch, Function<DataSet,Graph> perClusterSearch)
  
  Executes the unmixing process on the given dataset using the specified configuration, residual regressor, and optional graph search functions. The method applies residual signature extraction, EM clustering, and optionally searches for cluster-specific graphical representations.
  
  Parameters:
  
  data - The dataset to be processed, containing data points to be partitioned into clusters.
  
  cfg - The configuration object that provides parameters and settings for the unmixing algorithm.
  
  regressor - The residual regressor used for generating residual signatures and handling dependencies in the data.
  
  pooledSearch - A function that builds a pooled graphical representation of the dataset, may be null if not needed.
  
  perClusterSearch - A function that builds graphical representations for individual clusters, may be null if not applicable.
  
  Returns:
  
  An instance of UnmixResult containing the clustering results, including cluster labels, per-cluster datasets, and optionally cluster-specific graphs.
- selectK
  
  public static UnmixResult selectK(DataSet data, int Kmin, int Kmax, ResidualRegressor regressor, Function<DataSet,Graph> pooledSearch, Function<DataSet,Graph> perClusterSearch, EmUnmix.Config base)
  
  Selects the optimal number of clusters (K) for unmixing a dataset within the specified range [Kmin, Kmax]. The method evaluates clustering performance using a residual regressor and determines the best clustering configuration based on the Bayesian Information Criterion (BIC). The solution may optionally incorporate graphical searches for pooled or per-cluster representations.
  
  Parameters:
  
  data - The dataset to be analyzed for clustering. Contains data points to be partitioned.
  
  Kmin - The minimum number of clusters to evaluate. Must be at least 1.
  
  Kmax - The maximum number of clusters to evaluate. Cannot exceed the number of rows in the dataset.
  
  regressor - The residual regressor used for fitting the data and evaluating clustering performance.
  
  pooledSearch - A function for building a pooled graphical representation of the dataset. Can be null if not needed.
  
  perClusterSearch - A function for building graphical representations for individual clusters. Can be null if not applicable.
  
  base - The base configuration object containing parameters for clustering and residual calculation.
  
  Returns:
  
  An UnmixResult object containing the clustering results, including the best number of clusters (K), cluster labels, per-cluster datasets, and optionally cluster-specific graphs.
  
  Throws:
  
  IllegalArgumentException - If Kmin is less than 1 or if Kmax exceeds the number of rows in the dataset.

Class EmUnmix

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

EmUnmix

Method Details

run

selectK

run

selectK