Class AdTree

java.lang.Object
edu.cmu.tetrad.search.utils.AdTree

public class AdTree extends Object
An AD tree is a data structure used to store the data for a given dataset in a way that makes it easy to calculate cell counts for a multidimensional contingency table for a given set of variables.

It is a tree of cells, where each cell is a list of indices into the rows of the data set.

Each child cell is a subset of the parent cell for a particular value of a new variable in the list.

The list of variables is given in the method buildTable. The tree starts with a node representing the list of all rows in the data (or all rows given in the constructor), and then subdivides the data by each variable in turn. The leaves of the tree constitute the final subdivision of the data into cells, and the sizes of these cells are the counts for the multidimensional contingency table for the given list of variables.

Continuous variables are ignored for this data structure.

This is an adaptation of the AD tree as described in: Anderson, B., & Moore, A. W. (1998). AD-trees for fast counting and rule learning. In KDD98 Conference.

We add some optimizations to this class to help speed it up. First, we add a cache to store subdivisions for reuse. This cache uses a List<Node> as the key, which is the list of variables in the table. We limit the cache size to avoid using too much memory. We also add a maximum cache size parameter to make it easier to change the cache size.

Second, we add a depth limit to the cache. This is useful for cases where the cache is not being used effectively. We set the default depth limit to 5, which means that the cache will only store a tree of subdivision results out to a depth of 5. This is useful because long keys to the cache are less likely to be reused, so we only store initial segments of each branch.

Third, we internally sort the variables in the table to match the order of the nodesHash. This is useful because fewer branches need to be created in the tree. For instance, building a tree with variables [A, B, C] and [C, B, A] will result in the same marginal tables, just permuted, so we sort the variables to avoid this duplication. The inverse mapping is used to map the sorted indices back to the original order for the public methods.

Author:
josephramsey 2024-9-1