Class BoxDataSet

java.lang.Object
edu.cmu.tetrad.data.BoxDataSet
All Implemented Interfaces:
DataModel, DataSet, KnowledgeTransferable, VariableSource, TetradSerializable, Serializable

public final class BoxDataSet extends Object implements DataSet
Wraps a DataBox in such a way that mixed data sets can be stored. The type of each column must be specified by a Variable object, which must be either a ContinuousVariable or a DiscreteVariable. This class violates object orientation in that the underlying data matrix is retrievable using the getDoubleData() method. This is allowed so that external calculations may be performed on large datasets without having to allocate extra memory. If this matrix needs to be modified externally, please consider making a copy of it first, using the TetradMatrix copy() method.

The data set may be given a name; this name is not used internally.

The data set has a list of variables associated with it, as described above. This list is coordinated with the stored data, in that data for the i'th variable will be in the i'th column.

A subset of variables in the data set may be designated as selected. This selection set is stored with the data set and may be manipulated using the select and deselect methods.

A multiplicity m_i may be associated with each case c_i in the dataset, which is interpreted to mean that that c_i occurs m_i times in the dataset.

Knowledge may be associated with the data set, using the setKnowledge method. This knowledge is not used internally to the data set, but it may be retrieved by algorithms and used.

Version:
$Id: $Id
Author:
josephramsey
See Also:
  • Constructor Details

    • BoxDataSet

      public BoxDataSet(DataBox dataBox, List<Node> variables)
      Constructs a new data set with the given number of rows and columns, with all values set to missing.
      Parameters:
      dataBox - The data box.
      variables - The variables.
    • BoxDataSet

      public BoxDataSet(BoxDataSet dataSet)
      Makes of copy of the given data set.
      Parameters:
      dataSet - The data set to copy.
  • Method Details

    • serializableInstance

      public static BoxDataSet serializableInstance()
      Generates a simple exemplar of this class to test serialization.
      Returns:
      A simple exemplar of this class.
    • getColumnToTooltip

      public Map<String,String> getColumnToTooltip()

      Getter for the field columnToTooltip.

      Specified by:
      getColumnToTooltip in interface DataSet
      Returns:
      a Map object
    • getName

      public String getName()
      Gets the name of the data set.
      Specified by:
      getName in interface DataModel
      Specified by:
      getName in interface DataSet
      Returns:
      a String object
    • setName

      public void setName(String name)
      Sets the name of the data model (may be null).

      Sets the name of the data set.

      Specified by:
      setName in interface DataModel
      Parameters:
      name - the name to set
    • getNumColumns

      public int getNumColumns()

      getNumColumns.

      Specified by:
      getNumColumns in interface DataSet
      Returns:
      the number of variables in the data set.
    • getNumRows

      public int getNumRows()

      getNumRows.

      Specified by:
      getNumRows in interface DataSet
      Returns:
      the number of rows in the rectangular data set, which is the maximum of the number of rows in the list of wrapped columns.
    • setInt

      public void setInt(int row, int column, int value)
      Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.

      Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.

      Specified by:
      setInt in interface DataSet
      Parameters:
      row - The index of the case.
      column - The index of the variable.
      value - The value to set.
    • setDouble

      public void setDouble(int row, int column, double value)
      Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.

      Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.

      Specified by:
      setDouble in interface DataSet
      Parameters:
      row - The index of the case.
      column - The index of the variable.
      value - The value to set.
    • getObject

      public Object getObject(int row, int col)

      getObject.

      Specified by:
      getObject in interface DataSet
      Parameters:
      row - The index of the case.
      col - The index of the variable.
      Returns:
      the value at the given row and column as an Object. The type returned is deliberately vague, allowing for variables of any type. Primitives will be returned as corresponding wrapping objects (for example, doubles as Doubles).
    • setObject

      public void setObject(int row, int col, Object value)
      Sets the value at the given (row, column) to the given value.
      Specified by:
      setObject in interface DataSet
      Parameters:
      row - The index of the case.
      col - The index of the variable.
      value - The value to set.
    • getSelectedIndices

      public int[] getSelectedIndices()

      getSelectedIndices.

      Specified by:
      getSelectedIndices in interface DataSet
      Returns:
      the indices of the currently selected variables.
    • addVariable

      public void addVariable(Node variable)
      Adds the given variable to the data set.

      Adds the given variable to the data set, increasing the number of columns by one, moving columns i >= index to column i + 1, and inserting a column of missing values at column i.

      Specified by:
      addVariable in interface DataSet
      Parameters:
      variable - The variable to add.
    • addVariable

      public void addVariable(int index, Node variable)
      Adds the given variable at the given index.

      Adds the given variable to the dataset, increasing the number of columns by one, moving columns i >= index to column i + 1, and inserting a column of missing values at column i.

      Specified by:
      addVariable in interface DataSet
      Parameters:
      index - The index at which to add the variable.
      variable - The variable to add.
    • getVariable

      public Node getVariable(int col)

      getVariable.

      getVariable.

      Specified by:
      getVariable in interface DataSet
      Parameters:
      col - The index of the variable.
      Returns:
      the variable at the given column.
    • getColumn

      public int getColumn(Node variable)

      getColumn.

      Specified by:
      getColumn in interface DataSet
      Parameters:
      variable - The variable to check.
      Returns:
      the column index of the given variable.
    • changeVariable

      public void changeVariable(Node from, Node to)
      Changes the variable for the given column from from to to. Supported currently only for discrete variables.

      Changes the variable for the given column from from to to. Supported currently only for discrete variables.

      Specified by:
      changeVariable in interface DataSet
      Parameters:
      from - The variable to change.
      to - The variable to change to.
    • getVariable

      public Node getVariable(String varName)

      getVariable.

      Specified by:
      getVariable in interface DataModel
      Specified by:
      getVariable in interface DataSet
      Parameters:
      varName - a String object
      Returns:
      the variable with the given name, or null if no such variable exists.
    • getVariables

      public List<Node> getVariables()

      Getter for the field variables.

      Specified by:
      getVariables in interface DataSet
      Specified by:
      getVariables in interface VariableSource
      Returns:
      (a copy of) the List of Variables for the data set, in the order of their columns.
    • getKnowledge

      public Knowledge getKnowledge()

      Getter for the field knowledge.

      Specified by:
      getKnowledge in interface KnowledgeTransferable
      Returns:
      a copy of the knowledge associated with this data set. (Cannot be null.)
    • setKnowledge

      public void setKnowledge(Knowledge knowledge)
      Sets knowledge to a copy of the given object.

      Sets knowledge to be associated with this data set. May not be null.

      Specified by:
      setKnowledge in interface KnowledgeTransferable
      Parameters:
      knowledge - the knowledge to set
    • getVariableNames

      public List<String> getVariableNames()

      getVariableNames.

      Specified by:
      getVariableNames in interface DataSet
      Specified by:
      getVariableNames in interface VariableSource
      Returns:
      (a copy of) the List of Variables for the data set, in the order of their columns.
    • setSelected

      public void setSelected(Node variable, boolean selected)
      Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.

      Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.

      Specified by:
      setSelected in interface DataSet
      Parameters:
      variable - The variable to select or deselect.
      selected - True to select the variable, false to deselect it.
    • clearSelection

      public void clearSelection()
      Marks all variables as deselected.
      Specified by:
      clearSelection in interface DataSet
    • ensureRows

      public void ensureRows(int rows)
      Ensures that the dataset has at least rows rows. Used for pasting data into the dataset.

      Ensures that the dataset has at least the number of rows, adding rows if necessary to make that the case. The new rows will be filled with missing values.

      Specified by:
      ensureRows in interface DataSet
      Parameters:
      rows - The number of rows to ensure.
    • ensureColumns

      public void ensureColumns(int columns, List<String> excludedVariableNames)
      Ensures that the dataset has at least columns columns. Used for pasting data into the dataset. When creating new columns, names in the excludedVariableNames list may not be used. The purpose of this is to allow these names to be set later by the calling class, without incurring conflicts.

      Ensures that the dataset has at least the given number of columns, adding continuous variables with unique names until that is true. The new columns will be filled with missing values.

      Specified by:
      ensureColumns in interface DataSet
      Parameters:
      columns - The number of columns to ensure.
      excludedVariableNames - The names of variables that should not be used for new columns.
    • existsMissingValue

      public boolean existsMissingValue()

      existsMissingValue.

      Specified by:
      existsMissingValue in interface DataSet
      Returns:
      true if and only if this data set contains at least one missing value.
    • isSelected

      public boolean isSelected(Node variable)

      isSelected.

      Specified by:
      isSelected in interface DataSet
      Parameters:
      variable - The variable to check.
      Returns:
      true iff the given column has been marked as selected.
    • removeColumn

      public void removeColumn(int index)
      Removes the variable (and data) at the given index.

      Removes the column for the variable at the given index, reducing the number of columns by one.

      Specified by:
      removeColumn in interface DataSet
      Parameters:
      index - The index of the variable to remove.
    • removeColumn

      public void removeColumn(Node variable)
      Removes the given variable, along with all of its data.

      Removes the columns for the given variable from the dataset, reducing the number of columns by one.

      Specified by:
      removeColumn in interface DataSet
      Parameters:
      variable - The variable to remove.
    • subsetColumns

      public DataSet subsetColumns(List<Node> vars)
      Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.

      Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.

      Specified by:
      subsetColumns in interface DataSet
      Parameters:
      vars - The variables to include in the new data set.
      Returns:
      a new data set consisting of the variables in the list vars.
    • isContinuous

      public boolean isContinuous()

      isContinuous.

      Specified by:
      isContinuous in interface DataModel
      Specified by:
      isContinuous in interface DataSet
      Returns:
      true iff this is a continuous data set--that is, if every column in it is continuous. (By implication, empty datasets are both discrete and continuous.)
    • isDiscrete

      public boolean isDiscrete()

      isDiscrete.

      Specified by:
      isDiscrete in interface DataModel
      Specified by:
      isDiscrete in interface DataSet
      Returns:
      true iff this is a discrete data set--that is, if every column in it is discrete. (By implication, empty datasets are both discrete and continuous.)
    • isMixed

      public boolean isMixed()

      isMixed.

      Specified by:
      isMixed in interface DataModel
      Specified by:
      isMixed in interface DataSet
      Returns:
      true if this is a mixed data set--that is, if it contains at least one continuous column and one discrete columnn.
    • getCorrelationMatrix

      public Matrix getCorrelationMatrix()

      getCorrelationMatrix.

      Specified by:
      getCorrelationMatrix in interface DataSet
      Returns:
      the correlation matrix for this dataset. Defers to Statistic.covariance() in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any off-diagonal correlation involving a column with a missing value is Double.NaN, although all of the on-diagonal elements are 1.0. If that's not the desired behavior, missing values can be removed or imputed first.
    • getCovarianceMatrix

      public Matrix getCovarianceMatrix()

      getCovarianceMatrix.

      Specified by:
      getCovarianceMatrix in interface DataSet
      Returns:
      the covariance matrix for this dataset. Defers to Statistic.covariance() in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any covariance involving a column with a missing value is Double.NaN. If that's not the desired behavior, missing values can be removed or imputed first.
    • getInt

      public int getInt(int row, int column)

      getInt.

      Specified by:
      getInt in interface DataSet
      Parameters:
      row - The index of the case.
      column - The index of the variable.
      Returns:
      the value at the given row and column as an int, rounding if necessary. For discrete variables, this returns the category index of the datum for the variable at that column. Returns DiscreteVariable.MISSING_VALUE for missing values.
    • getDouble

      public double getDouble(int row, int column)

      getDouble.

      Specified by:
      getDouble in interface DataSet
      Parameters:
      row - The index of the case.
      column - The index of the variable.
      Returns:
      the value at the given row and column as a double. For discrete data, returns the integer value cast to a double.
    • toString

      public String toString()

      toString.

      Specified by:
      toString in interface DataModel
      Specified by:
      toString in interface DataSet
      Overrides:
      toString in class Object
      Returns:
      a string, suitable for printing, of the dataset. Lines are separated by '\n', tokens in the line by whatever character is set in the setOutputDelimiter() method. The list of variables is printed first, followed by one line for each case. This method should probably not be used for saving to files. If that's your goal, use the DataSavers class instead.
      See Also:
    • getDoubleData

      public Matrix getDoubleData()

      getDoubleData.

      Specified by:
      getDoubleData in interface DataSet
      Returns:
      a copy of the underlying COLT TetradMatrix matrix, containing all of the data in this dataset, discrete data included. Discrete data will be represented by ints cast to doubles. Rows in this matrix are cases, and columns are variables. The list of variable, in the order in which they occur in the matrix, is given by getVariables().

      If isMultipliersCollapsed() returns false, multipliers in the dataset are first expanded before returning the matrix, so the number of rows in the returned matrix may not be the same as the number of rows in this dataset.

      Throws:
      IllegalStateException - if this is not a continuous data set.
      See Also:
    • subsetColumns

      public DataSet subsetColumns(int[] indices)

      subsetColumns.

      Specified by:
      subsetColumns in interface DataSet
      Parameters:
      indices - an array of int objects
      Returns:
      a new data set in which the the column at indices[i] is placed at index i, for i = 0 to indices.length - 1. (Moved over from Purify.)
    • subsetRows

      public DataSet subsetRows(int[] rows)

      subsetRows.

      Specified by:
      subsetRows in interface DataSet
      Parameters:
      rows - an array of int objects
      Returns:
      a DataSet object
    • subsetRowsColumns

      public DataSet subsetRowsColumns(int[] rows, int[] columns)

      subsetRowsColumns.

      Specified by:
      subsetRowsColumns in interface DataSet
      Parameters:
      rows - an array of int objects
      columns - an array of int objects
      Returns:
      a DataSet object
    • removeCols

      public void removeCols(int[] cols)
      Removes the given columns from the data set.

      Removes the given columns from the data set.

      Specified by:
      removeCols in interface DataSet
      Parameters:
      cols - The indices of the columns to remove.
    • removeRows

      public void removeRows(int[] selectedRows)
      Removes the given rows from the data set.
      Specified by:
      removeRows in interface DataSet
      Parameters:
      selectedRows - an array of int objects
    • equals

      public boolean equals(Object obj)
      Checks if the given object is equal to this dataset.
      Specified by:
      equals in interface DataSet
      Overrides:
      equals in class Object
      Parameters:
      obj - The object to check.
      Returns:
      True if the given object is equal to this dataset.
    • copy

      public DataSet copy()
      Returns a copy of this dataset.
      Specified by:
      copy in interface DataModel
      Specified by:
      copy in interface DataSet
      Returns:
      A copy of this dataset.
    • like

      public DataSet like()
      Returns a dataset with the same dimensions as this dataset, but with no data.
      Specified by:
      like in interface DataSet
      Returns:
      a dataset with the same dimensions as this dataset, but with no data.
    • setOutputDelimiter

      public void setOutputDelimiter(Character character)
      The character used a delimiter when the dataset is output

      Sets the character ('\t', ' ', ',', for instance) that is used to delimit tokens when the data set is printed out using the toString() method.

      Specified by:
      setOutputDelimiter in interface DataSet
      Parameters:
      character - The character used as a delimiter when the dataset is output
      See Also:
    • permuteRows

      public void permuteRows()
      Randomly permutes the rows of the dataset.
      Specified by:
      permuteRows in interface DataSet
    • getNumberFormat

      public NumberFormat getNumberFormat()

      getNumberFormat.

      Specified by:
      getNumberFormat in interface DataSet
      Returns:
      the number format, which by default is the one at NumberFormatUtil.getInstance().getNumberFormat(), but can be set by the user if desired.
      See Also:
    • setNumberFormat

      public void setNumberFormat(NumberFormat nf)
      The number formatter used to print out continuous values.

      Sets the number format to be used when printing out the data set. The default is the one at

      Specified by:
      setNumberFormat in interface DataSet
      Parameters:
      nf - The number formatter used to print out continuous values.
    • getDataBox

      public DataBox getDataBox()

      Getter for the field dataBox.

      Returns:
      the data box that holds the data for this data set.