Class NumberObjectDataSet

java.lang.Object
edu.cmu.tetrad.data.NumberObjectDataSet
All Implemented Interfaces:
DataModel, DataSet, KnowledgeTransferable, VariableSource, TetradSerializable, Serializable

public final class NumberObjectDataSet extends Object implements DataSet
Wraps a 2D array of Number objects in such a way that mixed data sets can be stored. The type of each column must be specified by a Variable object, which must be either a ContinuousVariable or a DiscreteVariable. This class violates object orientation in that the underlying data matrix is retrievable using the getDoubleData() method. This is allowed so that external calculations may be performed on large datasets without having to allocate extra memory. If this matrix needs to be modified externally, please consider making a copy of it first, using the TetradMatrix copy() method.

The data set may be given a name; this name is not used internally.

The data set has a list of variables associated with it, as described above. This list is coordinated with the stored data, in that data for the i'th variable will be in the i'th column.

A subset of variables in the data set may be designated as selected. This selection set is stored with the data set and may be manipulated using the select and deselect methods.

// * A multiplicity m_i may be associated with each case c_i in the dataset, which // * is interpreted to mean that that c_i occurs m_i times in the dataset. // *

Knowledge may be associated with the data set, using the setKnowledge method. This knowledge is not used internally to the data set, but it may be retrieved by algorithm and used.

This data set replaces an earlier Minitab-style DataSet class. The reasons for replacement are as follows.

  • COLT marices are optimized for double 2D matrix calculations in ways that Java-style double[][] matrices are not.
  • The COLT library comes with a wide range of linear algebra library methods that are better tested and more flexible than that linear algebra methods used previously in Tetrad.
  • Views of COLT matrices can often be used in places where copies of data sets were being created.
  • The only place where data sets were being manipulated for honest reasons was in the interface; everywhere else, it turns out to have been sensible to calculate a list of variables and a sample size in advance and allocate memory for a data set with these dimensions. For very large data sets, it makes more sense to disallow memory-hogging manipulations than to throw out-of-memory errors.
Author:
josephramsey
See Also:
  • Constructor Details

    • NumberObjectDataSet

      public NumberObjectDataSet(Number[][] data, List<Node> variables)
  • Method Details

    • serializableInstance

      public static NumberObjectDataSet serializableInstance()
      Generates a simple exemplar of this class to test serialization.
    • getColumnToTooltip

      public Map<String,String> getColumnToTooltip()
      Specified by:
      getColumnToTooltip in interface DataSet
    • getName

      public String getName()
      Gets the name of the data set.
      Specified by:
      getName in interface DataModel
      Specified by:
      getName in interface DataSet
      Returns:
      the name of the data set.
    • setName

      public void setName(String name)
      Sets the name of the data set.
      Specified by:
      setName in interface DataModel
    • getNumColumns

      public int getNumColumns()
      Specified by:
      getNumColumns in interface DataSet
      Returns:
      the number of variables in the data set.
    • getNumRows

      public int getNumRows()
      Specified by:
      getNumRows in interface DataSet
      Returns:
      the number of rows in the rectangular data set, which is the maximum of the number of rows in the list of wrapped columns.
    • setInt

      public void setInt(int row, int column, int value)
      Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.
      Specified by:
      setInt in interface DataSet
      Parameters:
      row - The index of the case.
      column - The index of the variable.
    • setDouble

      public void setDouble(int row, int column, double value)
      Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.
      Specified by:
      setDouble in interface DataSet
      Parameters:
      row - The index of the case.
      column - The index of the variable.
    • getObject

      public Object getObject(int row, int col)
      Specified by:
      getObject in interface DataSet
      Parameters:
      row - The index of the case.
      col - The index of the variable.
      Returns:
      the value at the given row and column as an Object. The type returned is deliberately vague, allowing for variables of any type. Primitives will be returned as corresponding wrapping objects (for example, doubles as Doubles).
    • setObject

      public void setObject(int row, int col, Object value)
      Description copied from interface: DataSet
      Sets the value at the given (row, column) to the given value.
      Specified by:
      setObject in interface DataSet
      Parameters:
      row - The index of the case.
      col - The index of the variable.
    • getSelectedIndices

      public int[] getSelectedIndices()
      Specified by:
      getSelectedIndices in interface DataSet
      Returns:
      the indices of the currently selected variables.
    • getSelectedVariables

      public Set<Node> getSelectedVariables()
      Returns:
      the set of currently selected variables.
    • addVariable

      public void addVariable(Node variable)
      Adds the given variable to the data set, increasing the number of columns by one, moving columns i >= index to column i + 1, and inserting a column of missing values at column i.
      Specified by:
      addVariable in interface DataSet
      Throws:
      IllegalArgumentException - if the variable already exists in the dataset.
    • addVariable

      public void addVariable(int index, Node variable)
      Adds the given variable to the dataset, increasing the number of columns by one, moving columns i >= index to column i + 1, and inserting a column of missing values at column i.
      Specified by:
      addVariable in interface DataSet
    • getVariable

      public Node getVariable(int col)
      Specified by:
      getVariable in interface DataSet
      Returns:
      the variable at the given column.
    • getColumn

      public int getColumn(Node variable)
      Specified by:
      getColumn in interface DataSet
      Returns:
      the index of the column of the given variable. You can also get this by calling getVariable().indexOf(variable).
    • changeVariable

      public void changeVariable(Node from, Node to)
      Changes the variable for the given column from from to to. Supported currently only for discrete variables.
      Specified by:
      changeVariable in interface DataSet
      Throws:
      IllegalArgumentException - if the given change is not supported.
    • getVariable

      public Node getVariable(String varName)
      Specified by:
      getVariable in interface DataModel
      Specified by:
      getVariable in interface DataSet
      Returns:
      the variable with the given name.
    • getVariables

      public List<Node> getVariables()
      Description copied from interface: VariableSource
      Returns the list of variables associated with this object.
      Specified by:
      getVariables in interface DataSet
      Specified by:
      getVariables in interface VariableSource
      Returns:
      (a copy of) the List of Variables for the data set, in the order of their columns.
    • getKnowledge

      public Knowledge getKnowledge()
      Specified by:
      getKnowledge in interface KnowledgeTransferable
      Returns:
      a copy of the knowledge associated with this data set. (Cannot be null.)
    • setKnowledge

      public void setKnowledge(Knowledge knowledge)
      Sets knowledge to be associated with this data set. May not be null.
      Specified by:
      setKnowledge in interface KnowledgeTransferable
    • getVariableNames

      public List<String> getVariableNames()
      Description copied from interface: VariableSource
      Returns the variable names associated with this getVariableNames.
      Specified by:
      getVariableNames in interface DataSet
      Specified by:
      getVariableNames in interface VariableSource
      Returns:
      (a copy of) the List of Variables for the data set, in the order of their columns.
    • setSelected

      public void setSelected(Node variable, boolean selected)
      Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.
      Specified by:
      setSelected in interface DataSet
    • clearSelection

      public void clearSelection()
      Marks all variables as deselected.
      Specified by:
      clearSelection in interface DataSet
    • ensureRows

      public void ensureRows(int rows)
      Ensures that the dataset has at least the number of rows, adding rows if necessary to make that the case. The new rows will be filled with missing values.
      Specified by:
      ensureRows in interface DataSet
    • ensureColumns

      public void ensureColumns(int columns, List<String> excludedVariableNames)
      Ensures that the dataset has at least the given number of columns, adding continuous variables with unique names until that is true. The new columns will be filled with missing values.
      Specified by:
      ensureColumns in interface DataSet
    • existsMissingValue

      public boolean existsMissingValue()
      Description copied from interface: DataSet
      Returns true if and only if this data set contains at least one missing value.
      Specified by:
      existsMissingValue in interface DataSet
    • isSelected

      public boolean isSelected(Node variable)
      Specified by:
      isSelected in interface DataSet
      Returns:
      true iff the given column has been marked as selected.
    • removeColumn

      public void removeColumn(int index)
      Removes the column for the variable at the given index, reducing the number of columns by one.
      Specified by:
      removeColumn in interface DataSet
    • removeColumn

      public void removeColumn(Node variable)
      Removes the columns for the given variable from the dataset, reducing the number of columns by one.
      Specified by:
      removeColumn in interface DataSet
    • subsetColumns

      public DataSet subsetColumns(List<Node> vars)
      Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.
      Specified by:
      subsetColumns in interface DataSet
    • isContinuous

      public boolean isContinuous()
      Specified by:
      isContinuous in interface DataModel
      Specified by:
      isContinuous in interface DataSet
      Returns:
      true iff this is a continuous data set--that is, if every column in it is continuous. (By implication, empty datasets are both discrete and continuous.)
    • isDiscrete

      public boolean isDiscrete()
      Specified by:
      isDiscrete in interface DataModel
      Specified by:
      isDiscrete in interface DataSet
      Returns:
      true iff this is a discrete data set--that is, if every column in it is discrete. (By implication, empty datasets are both discrete and continuous.)
    • isMixed

      public boolean isMixed()
      Specified by:
      isMixed in interface DataModel
      Specified by:
      isMixed in interface DataSet
      Returns:
      true if this is a mixed data set--that is, if it contains at least one continuous column and one discrete columnn.
    • getCorrelationMatrix

      public Matrix getCorrelationMatrix()
      Description copied from interface: DataSet
      If this is a continuous data set, returns the correlation matrix.
      Specified by:
      getCorrelationMatrix in interface DataSet
      Returns:
      the correlation matrix for this dataset. Defers to Statistic.covariance() in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any off-diagonal correlation involving a column with a missing value is Double.NaN, although all of the on-diagonal elements are 1.0. If that's not the desired behavior, missing values can be removed or imputed first.
    • getCovarianceMatrix

      public Matrix getCovarianceMatrix()
      Description copied from interface: DataSet
      If this is a continuous data set, returns the covariance matrix.
      Specified by:
      getCovarianceMatrix in interface DataSet
      Returns:
      the covariance matrix for this dataset. Defers to Statistic.covariance() in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any covariance involving a column with a missing value is Double.NaN. If that's not the desired behavior, missing values can be removed or imputed first.
    • getInt

      public int getInt(int row, int column)
      Specified by:
      getInt in interface DataSet
      Returns:
      the value at the given row and column, rounded to the nearest integer, or DiscreteVariable.MISSING_VALUE if the value is missing.
    • getDouble

      public double getDouble(int row, int column)
      Specified by:
      getDouble in interface DataSet
      Returns:
      the double value at the given row and column. For discrete variables, this returns an int cast to a double. The double value at the given row and column may be missing, in which case Double.NaN is returned.
    • toString

      public String toString()
      Description copied from interface: DataModel
      Renders the data model as as String.
      Specified by:
      toString in interface DataModel
      Specified by:
      toString in interface DataSet
      Overrides:
      toString in class Object
      Returns:
      a string, suitable for printing, of the dataset. Lines are separated by '\n', tokens in the line by whatever character is set in the setOutputDelimiter() method. The list of variables is printed first, followed by one line for each case. This method should probably not be used for saving to files. If that's your goal, use the DataSavers class instead.
      See Also:
    • getDoubleData

      public Matrix getDoubleData()
      Specified by:
      getDoubleData in interface DataSet
      Returns:
      a copy of the underlying COLT TetradMatrix matrix, containing all of the data in this dataset, discrete data included. Discrete data will be represented by ints cast to doubles. Rows in this matrix are cases, and columns are variables. The list of variable, in the order in which they occur in the matrix, is given by getVariable(). // *

      // * If isMultipliersCollapsed() returns false, multipliers in the dataset are // * first expanded before returning the matrix, so the number of rows in the // * returned matrix may not be the same as the number of rows in this // * dataset.

      Throws:
      IllegalStateException - if this is not a continuous data set.
      See Also:
    • subsetColumns

      public DataSet subsetColumns(int[] indices)
      Specified by:
      subsetColumns in interface DataSet
      Returns:
      a new data set in which the the column at indices[i] is placed at index i, for i = 0 to indices.length - 1. (Moved over from Purify.)
    • subsetRows

      public DataSet subsetRows(int[] rows)
      Specified by:
      subsetRows in interface DataSet
      Returns:
      a new data set in which the the row at indices[i] is placed at index i, for i = 0 to indices.length - 1. (View instead?)
    • subsetRowsColumns

      public DataSet subsetRowsColumns(int[] rows, int[] columns)
      Specified by:
      subsetRowsColumns in interface DataSet
    • removeCols

      public void removeCols(int[] cols)
      Removes the given columns from the data set.
      Specified by:
      removeCols in interface DataSet
    • removeRows

      public void removeRows(int[] selectedRows)
      Removes the given rows from the data set.
      Specified by:
      removeRows in interface DataSet
    • equals

      public boolean equals(Object obj)
      Specified by:
      equals in interface DataSet
      Overrides:
      equals in class Object
      Returns:
      true iff obj is a continuous RectangularDataSet with corresponding variables of the same name and corresponding data values equal, when rendered using the number format at NumberFormatUtil.getInstance().getNumberFormat().
    • copy

      public DataSet copy()
      Specified by:
      copy in interface DataModel
      Specified by:
      copy in interface DataSet
    • like

      public DataSet like()
      Specified by:
      like in interface DataSet
    • setOutputDelimiter

      public void setOutputDelimiter(Character character)
      Sets the character ('\t', ' ', ',', for instance) that is used to delimit tokens when the data set is printed out using the toString() method.
      Specified by:
      setOutputDelimiter in interface DataSet
      See Also:
    • permuteRows

      public void permuteRows()
      Randomly permutes the rows of the dataset.
      Specified by:
      permuteRows in interface DataSet
    • getNumberFormat

      public NumberFormat getNumberFormat()
      Description copied from interface: DataSet
      The number format of the dataset.
      Specified by:
      getNumberFormat in interface DataSet
      Returns:
      the number format, which by default is the one at NumberFormatUtil.getInstance().getNumberFormat(), but can be set by the user if desired.
      See Also:
    • setNumberFormat

      public void setNumberFormat(NumberFormat nf)
      Description copied from interface: DataSet
      The number formatter used to print out continuous values.
      Specified by:
      setNumberFormat in interface DataSet