edu.cmu.tetrad.search.utils.AdTree

public class AdTree extends Object

An AD tree is a data structure used to store the data for a given dataset in a way that makes it easy to calculate cell counts for a multidimensional contingency table for a given set of variables.

It is a tree of cells, where each cell is a list of indices into the rows of the data set.

Each child cell is a subset of the parent cell for a particular value of a new variable in the list.

The list of variables is given in the method buildTable. The tree starts with a node representing the list of all rows in the data (or all rows given in the constructor), and then subdivides the data by each variable in turn. The leaves of the tree constitute the final subdivision of the data into cells, and the sizes of these cells are the counts for the multidimensional contingency table for the given list of variables.

Continuous variables are ignored for this data structure.

This is an adaptation of the AD tree as described in: Anderson, B., & Moore, A. W. (1998). AD-trees for fast counting and rule learning. In KDD98 Conference.

We add some optimizations to this class to help speed it up. First, we add a cache to store subdivisions for reuse. This cache uses a List<Node> as the key, which is the list of variables in the table. We limit the cache size to avoid using too much memory. We also add a maximum cache size parameter to make it easier to change the cache size.

Second, we add a depth limit to the cache. This is useful for cases where the cache is not being used effectively. We set the default depth limit to 5, which means that the cache will only store a tree of subdivision results out to a depth of 5. This is useful because long keys to the cache are less likely to be reused, so we only store initial segments of each branch.

Third, we internally sort the variables in the table to match the order of the nodesHash. This is useful because fewer branches need to be created in the tree. For instance, building a tree with variables [A, B, C] and [C, B, A] will result in the same marginal tables, just permuted, so we sort the variables to avoid this duplication. The inverse mapping is used to map the sorted indices back to the original order for the public methods.

Author:: josephramsey 2024-9-1

Constructor Summary

Constructors

Constructor

Description

AdTree(DataSet dataSet)

Constructs an AD Leaf Tree for the given dataset, without subsampling.

AdTree(DataSet dataSet, List<Integer> rows)

Constructs an AD Leaf Tree for the given dataset.
Method Summary

Modifier and Type

Method

Description

void

buildTable(List<DiscreteVariable> variables)

Builds the contingency table based on the given list of discrete variables.

List<Integer>

getCell(int cellIndex)

Returns the cell in the table for the given index.

int

getCellIndex(int... coords)

Returns the index of the cell in the table for the given coordinates.

int

getCount(int cellIndex)

Returns the count of the cell in the table for the given index.

int

getCount(int[] coords)

Returns the count of the cell in the table for the given coordinates.

int

getNumCells()

Return the number of cells in the table.

void

setCacheDepthLimit(int cacheDepthLimit)

Sets the maximum depth of the cache.

void

setMaxCacheSize(int maxCacheSize)

Sets the maximum size of the cache.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- AdTree
  
  public AdTree(DataSet dataSet)
  
  Constructs an AD Leaf Tree for the given dataset, without subsampling.
  
  Parameters:
  
  dataSet - A discrete dataset.
- AdTree
  
  public AdTree(DataSet dataSet, List<Integer> rows)
  
  Constructs an AD Leaf Tree for the given dataset.
  
  Parameters:
  
  dataSet - A discrete dataset.
  
  rows - The rows of the dataset to use; the default is to use all the rows.
  This is useful for subsampling.
Method Details
- buildTable
  
  public void buildTable(List<DiscreteVariable> variables)
  
  Builds the contingency table based on the given list of discrete variables.
  
  Parameters:
  
  variables - A list of discrete variables.
- getNumCells
  
  public int getNumCells()
  
  Return the number of cells in the table.
  
  Returns:
  
  the number of cells in the table.
- getCellIndex
  
  public int getCellIndex(int... coords)
  
  Returns the index of the cell in the table for the given coordinates. It is assumed that the given coordinates are within the bounds of the table--that is, that the first coordinate is between 0 and the dimension of the first of the first variable, the second coordinate is between 0 and the dimension of the second variable, and so on. The coordinates are 0-based. It is also assumed that the number of coordinates is equal to the number of variables in the table.
  
  Parameters:
  
  coords - the coordinates of the cell.
  
  Returns:
  
  the index of the cell in the table.
  
  Throws:
  
  IllegalArgumentException - if the coordinates are null, if there are too many coordinates, if a coordinate is out of bounds, or if the number of coordinates is not equal to the number of variables in the table.
- getCell
  public List<Integer> getCell(int cellIndex)
  
  Returns the cell in the table for the given index.
  This is a list of indices into the rows of the data set.
  
  Parameters:
  
  cellIndex - the index of the cell.
  
  Returns:
  
  the cell in the table.
  
  See Also:
  
  getCellIndex(int[])
- getCount
  
  public int getCount(int[] coords)
  
  Returns the count of the cell in the table for the given coordinates.
  
  Parameters:
  
  coords - the coordinates of the cell.
  
  Returns:
  
  the cell in the table.
- getCount
  
  public int getCount(int cellIndex)
  
  Returns the count of the cell in the table for the given index.
  
  Parameters:
  
  cellIndex - the index of the cell.
  
  Returns:
  
  the cell in the table.
- setMaxCacheSize
  
  public void setMaxCacheSize(int maxCacheSize)
  
  Sets the maximum size of the cache. This is useful for cases where the cache is not being used effectively.
  
  Parameters:
  
  maxCacheSize - The maximum size of the cache. Must be >= 1. Default is 1000.
- setCacheDepthLimit
  
  public void setCacheDepthLimit(int cacheDepthLimit)
  
  Sets the maximum depth of the cache. This is useful for cases where the cache is not being used effectively.
  
  Parameters:
  
  cacheDepthLimit - The maximum depth of the cache. Must be >= 0. Default it 5.

Class AdTree

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

AdTree

AdTree

Method Details

buildTable

getNumCells

getCellIndex

getCell

getCount

getCount

setMaxCacheSize

setCacheDepthLimit