Class AdTree
It is a tree of cells, where each cell is a list of indices into the rows of the data set.
Each child cell is a subset of the parent cell for a particular value of a new variable in the list.
The list of variables is given in the method buildTable
. The tree starts with a node representing the
list of all rows in the data (or all rows given in the constructor), and then subdivides the data by each variable in
turn. The leaves of the tree constitute the final subdivision of the data into cells, and the sizes of these cells
are the counts for the multidimensional contingency table for the given list of variables.
Continuous variables are ignored for this data structure.
This is an adaptation of the AD tree as described in: Anderson, B., & Moore, A. W. (1998). AD-trees for fast counting and rule learning. In KDD98 Conference.
We add some optimizations to this class to help speed it up. First, we add a cache to store subdivisions for reuse. This cache uses a List<Node> as the key, which is the list of variables in the table. We limit the cache size to avoid using too much memory. We also add a maximum cache size parameter to make it easier to change the cache size.
Second, we add a depth limit to the cache. This is useful for cases where the cache is not being used effectively. We set the default depth limit to 5, which means that the cache will only store a tree of subdivision results out to a depth of 5. This is useful because long keys to the cache are less likely to be reused, so we only store initial segments of each branch.
Third, we internally sort the variables in the table to match the order of the nodesHash. This is useful because fewer branches need to be created in the tree. For instance, building a tree with variables [A, B, C] and [C, B, A] will result in the same marginal tables, just permuted, so we sort the variables to avoid this duplication. The inverse mapping is used to map the sorted indices back to the original order for the public methods.
- Author:
- josephramsey 2024-9-1
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
buildTable
(List<DiscreteVariable> variables) Builds the contingency table based on the given list of discrete variables.getCell
(int cellIndex) Returns the cell in the table for the given index.int
getCellIndex
(int... coords) Returns the index of the cell in the table for the given coordinates.int
getCount
(int cellIndex) Returns the count of the cell in the table for the given index.int
getCount
(int[] coords) Returns the count of the cell in the table for the given coordinates.int
Return the number of cells in the table.void
setCacheDepthLimit
(int cacheDepthLimit) Sets the maximum depth of the cache.void
setMaxCacheSize
(int maxCacheSize) Sets the maximum size of the cache.
-
Constructor Details
-
AdTree
Constructs an AD Leaf Tree for the given dataset, without subsampling.- Parameters:
dataSet
- A discrete dataset.
-
AdTree
-
-
Method Details
-
buildTable
Builds the contingency table based on the given list of discrete variables.- Parameters:
variables
- A list of discrete variables.
-
getNumCells
public int getNumCells()Return the number of cells in the table.- Returns:
- the number of cells in the table.
-
getCellIndex
public int getCellIndex(int... coords) Returns the index of the cell in the table for the given coordinates. It is assumed that the given coordinates are within the bounds of the table--that is, that the first coordinate is between 0 and the dimension of the first of the first variable, the second coordinate is between 0 and the dimension of the second variable, and so on. The coordinates are 0-based. It is also assumed that the number of coordinates is equal to the number of variables in the table.- Parameters:
coords
- the coordinates of the cell.- Returns:
- the index of the cell in the table.
- Throws:
IllegalArgumentException
- if the coordinates are null, if there are too many coordinates, if a coordinate is out of bounds, or if the number of coordinates is not equal to the number of variables in the table.
-
getCell
-
getCount
public int getCount(int[] coords) Returns the count of the cell in the table for the given coordinates.- Parameters:
coords
- the coordinates of the cell.- Returns:
- the cell in the table.
-
getCount
public int getCount(int cellIndex) Returns the count of the cell in the table for the given index.- Parameters:
cellIndex
- the index of the cell.- Returns:
- the cell in the table.
-
setMaxCacheSize
public void setMaxCacheSize(int maxCacheSize) Sets the maximum size of the cache. This is useful for cases where the cache is not being used effectively.- Parameters:
maxCacheSize
- The maximum size of the cache. Must be >= 1. Default is 1000.
-
setCacheDepthLimit
public void setCacheDepthLimit(int cacheDepthLimit) Sets the maximum depth of the cache. This is useful for cases where the cache is not being used effectively.- Parameters:
cacheDepthLimit
- The maximum depth of the cache. Must be >= 0. Default it 5.
-