edu.cmu.tetrad.data.NumberObjectDataSet

All Implemented Interfaces:: DataModel, DataSet, KnowledgeTransferable, VariableSource, TetradSerializable, Serializable

public final class NumberObjectDataSet extends Object implements DataSet

Wraps a 2D array of Number objects in such a way that mixed data sets can be stored. The type of each column must be specified by a Variable object, which must be either a ContinuousVariable or a DiscreteVariable. This class violates object orientation in that the underlying data matrix is retrievable using the getDoubleData() method. This is allowed so that external calculations may be performed on large datasets without having to allocate extra memory. If this matrix needs to be modified externally, please consider making a copy of it first, using the TetradMatrix copy() method.

The data set may be given a name; this name is not used internally.

The data set has a list of variables associated with it, as described above. This list is coordinated with the stored data, in that data for the i'th variable will be in the i'th column.

A subset of variables in the data set may be designated as selected. This selection set is stored with the data set and may be manipulated using the select and deselect methods.

// * A multiplicity m_i may be associated with each case c_i in the dataset, which // * is interpreted to mean that that c_i occurs m_i times in the dataset. // *

Knowledge may be associated with the data set, using the setKnowledge method. This knowledge is not used internally to the data set, but it may be retrieved by algorithm and used.

This data set replaces an earlier Minitab-style DataSet class. The reasons for replacement are as follows.

COLT marices are optimized for double 2D matrix calculations in ways that Java-style double[][] matrices are not.
The COLT library comes with a wide range of linear algebra library methods that are better tested and more flexible than that linear algebra methods used previously in Tetrad.
Views of COLT matrices can often be used in places where copies of data sets were being created.
The only place where data sets were being manipulated for honest reasons was in the interface; everywhere else, it turns out to have been sensible to calculate a list of variables and a sample size in advance and allocate memory for a data set with these dimensions. For very large data sets, it makes more sense to disallow memory-hogging manipulations than to throw out-of-memory errors.

Version:

$Id: $Id

Author:

josephramsey

See Also:

Constructor Summary

Constructors

Constructor

Description

NumberObjectDataSet(Number[][] data, List<Node> variables)

Constructor for NumberObjectDataSet.
Method Summary

Modifier and Type

Method

Description

void

addVariable(int index, Node variable)

Adds the given variable at the given index.

void

addVariable(Node variable)

Adds the given variable to the data set.

void

changeVariable(Node from, Node to)

Changes the variable for the given column from from to to.

void

clearSelection()

Marks all variables as deselected.

DataSet

copy()

Returns a copy of this dataset.

void

ensureColumns(int columns, List<String> excludedVariableNames)

Ensures that the dataset has at least columns columns.

void

ensureRows(int rows)

Ensures that the dataset has at least rows rows.

boolean

equals(Object obj)

Checks if the given object is equal to this dataset.

boolean

existsMissingValue()

existsMissingValue.

int

getColumn(Node variable)

getColumn.

Map<String,String>

getColumnToTooltip()

Getter for the field columnToTooltip.

Matrix

getCorrelationMatrix()

getCorrelationMatrix.

Matrix

getCovarianceMatrix()

getCovarianceMatrix.

double

getDouble(int row, int column)

getDouble.

Matrix

getDoubleData()

getDoubleData.

int

getInt(int row, int column)

getInt.

Knowledge

getKnowledge()

Getter for the field knowledge.

String

getName()

Gets the name of the data set.

NumberFormat

getNumberFormat()

getNumberFormat.

int

getNumColumns()

getNumColumns.

int

getNumRows()

getNumRows.

Object

getObject(int row, int col)

getObject.

int[]

getSelectedIndices()

getSelectedIndices.

Set<Node>

getSelectedVariables()

getSelectedVariables.

Node

getVariable(int col)

getVariable.

Node

getVariable(String varName)

getVariable.

List<String>

getVariableNames()

getVariableNames.

List<Node>

getVariables()

Getter for the field variables.

boolean

isContinuous()

isContinuous.

boolean

isDiscrete()

isDiscrete.

boolean

isMixed()

isMixed.

boolean

isSelected(Node variable)

isSelected.

DataSet

like()

Returns a dataset with the same dimensions as this dataset, but with no data.

void

permuteRows()

Randomly permutes the rows of the dataset.

void

removeCols(int[] cols)

Removes the given columns from the data set.

void

removeColumn(int index)

Removes the variable (and data) at the given index.

void

removeColumn(Node variable)

Removes the given variable, along with all of its data.

void

removeRows(int[] selectedRows)

Removes the given rows from the data set.

static NumberObjectDataSet

serializableInstance()

Generates a simple exemplar of this class to test serialization.

void

setDouble(int row, int column, double value)

Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.

void

setInt(int row, int column, int value)

Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.

void

setKnowledge(Knowledge knowledge)

Sets knowledge to a copy of the given object.

void

setName(String name)

Sets the name of the data model (may be null).

void

setNumberFormat(NumberFormat nf)

The number formatter used to print out continuous values.

void

setObject(int row, int col, Object value)

Sets the value at the given (row, column) to the given value.

void

setOutputDelimiter(Character character)

The character used a delimiter when the dataset is output

void

setSelected(Node variable, boolean selected)

Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.

DataSet

subsetColumns(int[] indices)

subsetColumns.

DataSet

subsetColumns(List<Node> vars)

Creates and returns a dataset consisting of those variables in the list vars.

DataSet

subsetRows(int[] rows)

subsetRows.

DataSet

subsetRowsColumns(int[] rows, int[] columns)

Generates a subset of the current DataSet by selecting specified rows and columns.

String

toString()

toString.

Methods inherited from class java.lang.Object
getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Details
- NumberObjectDataSet
  
  public NumberObjectDataSet(Number[][] data, List<Node> variables)
  
  Constructor for NumberObjectDataSet.
  
  Parameters:
  
  data - an array of Number objects
  
  variables - a List object
Method Details
- serializableInstance
  
  public static NumberObjectDataSet serializableInstance()
  
  Generates a simple exemplar of this class to test serialization.
  
  Returns:
  
  a NumberObjectDataSet object
- getColumnToTooltip
  
  public Map<String,String> getColumnToTooltip()
  
  Getter for the field columnToTooltip.
  
  Specified by:
  
  getColumnToTooltip in interface DataSet
  
  Returns:
  
  a Map object
- getName
  
  public String getName()
  
  Gets the name of the data set.
  
  Specified by:
  
  getName in interface DataModel
  
  Specified by:
  
  getName in interface DataSet
  
  Returns:
  
  a String object
- setName
  
  public void setName(String name)
  
  Sets the name of the data model (may be null).
  Sets the name of the data set.
  
  Specified by:
  
  setName in interface DataModel
  
  Parameters:
  
  name - the name to set
- getNumColumns
  
  public int getNumColumns()
  
  getNumColumns.
  
  Specified by:
  
  getNumColumns in interface DataSet
  
  Returns:
  
  the number of variables in the data set.
- getNumRows
  
  public int getNumRows()
  
  getNumRows.
  
  Specified by:
  
  getNumRows in interface DataSet
  
  Returns:
  
  the number of rows in the rectangular data set, which is the maximum of the number of rows in the list of wrapped columns.
- setInt
  
  public void setInt(int row, int column, int value)
  
  Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.
  Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.
  
  Specified by:
  
  setInt in interface DataSet
  
  Parameters:
  
  row - The index of the case.
  
  column - The index of the variable.
  
  value - The value to set.
- setDouble
  
  public void setDouble(int row, int column, double value)
  
  Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.
  Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.
  
  Specified by:
  
  setDouble in interface DataSet
  
  Parameters:
  
  row - The index of the case.
  
  column - The index of the variable.
  
  value - The value to set.
- getObject
  
  public Object getObject(int row, int col)
  
  getObject.
  
  Specified by:
  
  getObject in interface DataSet
  
  Parameters:
  
  row - The index of the case.
  
  col - The index of the variable.
  
  Returns:
  
  the value at the given row and column as an Object. The type returned is deliberately vague, allowing for variables of any type. Primitives will be returned as corresponding wrapping objects (for example, doubles as Doubles).
- setObject
  
  public void setObject(int row, int col, Object value)
  
  Sets the value at the given (row, column) to the given value.
  
  Specified by:
  
  setObject in interface DataSet
  
  Parameters:
  
  row - The index of the case.
  
  col - The index of the variable.
  
  value - The value to set.
- getSelectedIndices
  
  public int[] getSelectedIndices()
  
  getSelectedIndices.
  
  Specified by:
  
  getSelectedIndices in interface DataSet
  
  Returns:
  
  the indices of the currently selected variables.
- getSelectedVariables
  
  public Set<Node> getSelectedVariables()
  
  getSelectedVariables.
  
  Returns:
  
  the set of currently selected variables.
- addVariable
  
  public void addVariable(Node variable)
  
  Adds the given variable to the data set.
  Adds the given variable to the data set, increasing the number of columns by one, moving columns i >= index to column i + 1, and inserting a column of missing values at column i.
  
  Specified by:
  
  addVariable in interface DataSet
  
  Parameters:
  
  variable - The variable to add.
- addVariable
  
  public void addVariable(int index, Node variable)
  
  Adds the given variable at the given index.
  Adds the given variable to the dataset, increasing the number of columns by one, moving columns i >= index to column i + 1, and inserting a column of missing values at column i.
  
  Specified by:
  
  addVariable in interface DataSet
  
  Parameters:
  
  index - The index at which to add the variable.
  
  variable - The variable to add.
- getVariable
  
  public Node getVariable(int col)
  
  getVariable.
  
  getVariable.
  
  Specified by:
  
  getVariable in interface DataSet
  
  Parameters:
  
  col - The index of the variable.
  
  Returns:
  
  the variable at the given column.
- getColumn
  
  public int getColumn(Node variable)
  
  getColumn.
  
  Specified by:
  
  getColumn in interface DataSet
  
  Parameters:
  
  variable - The variable to check.
  
  Returns:
  
  the column index of the given variable.
- changeVariable
  
  public void changeVariable(Node from, Node to)
  
  Changes the variable for the given column from from to to. Supported currently only for discrete variables.
  Changes the variable for the given column from from to to. Supported currently only for discrete variables.
  
  Specified by:
  
  changeVariable in interface DataSet
  
  Parameters:
  
  from - The variable to change.
  
  to - The variable to change to.
- getVariable
  
  public Node getVariable(String varName)
  
  getVariable.
  
  Specified by:
  
  getVariable in interface DataModel
  
  Specified by:
  
  getVariable in interface DataSet
  
  Parameters:
  
  varName - a String object
  
  Returns:
  
  the variable with the given name, or null if no such variable exists.
- getVariables
  
  public List<Node> getVariables()
  
  Getter for the field variables.
  
  Specified by:
  
  getVariables in interface DataSet
  
  Specified by:
  
  getVariables in interface VariableSource
  
  Returns:
  
  (a copy of) the List of Variables for the data set, in the order of their columns.
- getKnowledge
  
  public Knowledge getKnowledge()
  
  Getter for the field knowledge.
  
  Specified by:
  
  getKnowledge in interface KnowledgeTransferable
  
  Returns:
  
  a copy of the knowledge associated with this data set. (Cannot be null.)
- setKnowledge
  
  public void setKnowledge(Knowledge knowledge)
  
  Sets knowledge to a copy of the given object.
  Sets knowledge to be associated with this data set. May not be null.
  
  Specified by:
  
  setKnowledge in interface KnowledgeTransferable
  
  Parameters:
  
  knowledge - the knowledge to set
- getVariableNames
  
  public List<String> getVariableNames()
  
  getVariableNames.
  
  Specified by:
  
  getVariableNames in interface DataSet
  
  Specified by:
  
  getVariableNames in interface VariableSource
  
  Returns:
  
  (a copy of) the List of Variables for the data set, in the order of their columns.
- setSelected
  
  public void setSelected(Node variable, boolean selected)
  
  Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.
  Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.
  
  Specified by:
  
  setSelected in interface DataSet
  
  Parameters:
  
  variable - The variable to select or deselect.
  
  selected - True to select the variable, false to deselect it.
- clearSelection
  
  public void clearSelection()
  
  Marks all variables as deselected.
  
  Specified by:
  
  clearSelection in interface DataSet
- ensureRows
  
  public void ensureRows(int rows)
  
  Ensures that the dataset has at least rows rows. Used for pasting data into the dataset.
  Ensures that the dataset has at least the number of rows, adding rows if necessary to make that the case. The new rows will be filled with missing values.
  
  Specified by:
  
  ensureRows in interface DataSet
  
  Parameters:
  
  rows - The number of rows to ensure.
- ensureColumns
  
  public void ensureColumns(int columns, List<String> excludedVariableNames)
  
  Ensures that the dataset has at least columns columns. Used for pasting data into the dataset. When creating new columns, names in the excludedVariableNames list may not be used. The purpose of this is to allow these names to be set later by the calling class, without incurring conflicts.
  Ensures that the dataset has at least the given number of columns, adding continuous variables with unique names until that is true. The new columns will be filled with missing values.
  
  Specified by:
  
  ensureColumns in interface DataSet
  
  Parameters:
  
  columns - The number of columns to ensure.
  
  excludedVariableNames - The names of variables that should not be used for new columns.
- existsMissingValue
  
  public boolean existsMissingValue()
  
  existsMissingValue.
  
  Specified by:
  
  existsMissingValue in interface DataSet
  
  Returns:
  
  true if and only if this data set contains at least one missing value.
- isSelected
  
  public boolean isSelected(Node variable)
  
  isSelected.
  
  Specified by:
  
  isSelected in interface DataSet
  
  Parameters:
  
  variable - The variable to check.
  
  Returns:
  
  true iff the given column has been marked as selected.
- removeColumn
  
  public void removeColumn(int index)
  
  Removes the variable (and data) at the given index.
  Removes the column for the variable at the given index, reducing the number of columns by one.
  
  Specified by:
  
  removeColumn in interface DataSet
  
  Parameters:
  
  index - The index of the variable to remove.
- removeColumn
  
  public void removeColumn(Node variable)
  
  Removes the given variable, along with all of its data.
  Removes the columns for the given variable from the dataset, reducing the number of columns by one.
  
  Specified by:
  
  removeColumn in interface DataSet
  
  Parameters:
  
  variable - a Node object
- subsetColumns
  
  public DataSet subsetColumns(List<Node> vars)
  
  Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.
  Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.
  
  Specified by:
  
  subsetColumns in interface DataSet
  
  Parameters:
  
  vars - The variables to include in the new data set.
  
  Returns:
  
  a new data set consisting of the variables in the list vars.
- isContinuous
  
  public boolean isContinuous()
  
  isContinuous.
  
  Specified by:
  
  isContinuous in interface DataModel
  
  Specified by:
  
  isContinuous in interface DataSet
  
  Returns:
  
  true iff this is a continuous data set--that is, if every column in it is continuous. (By implication, empty datasets are both discrete and continuous.)
- isDiscrete
  
  public boolean isDiscrete()
  
  isDiscrete.
  
  Specified by:
  
  isDiscrete in interface DataModel
  
  Specified by:
  
  isDiscrete in interface DataSet
  
  Returns:
  
  true iff this is a discrete data set--that is, if every column in it is discrete. (By implication, empty datasets are both discrete and continuous.)
- isMixed
  
  public boolean isMixed()
  
  isMixed.
  
  Specified by:
  
  isMixed in interface DataModel
  
  Specified by:
  
  isMixed in interface DataSet
  
  Returns:
  
  true if this is a mixed data set--that is, if it contains at least one continuous column and one discrete columnn.
- getCorrelationMatrix
  
  public Matrix getCorrelationMatrix()
  
  getCorrelationMatrix.
  
  Specified by:
  
  getCorrelationMatrix in interface DataSet
  
  Returns:
  
  the correlation matrix for this dataset. Defers to Statistic.covariance() in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any off-diagonal correlation involving a column with a missing value is Double.NaN, although all of the on-diagonal elements are 1.0. If that's not the desired behavior, missing values can be removed or imputed first.
- getCovarianceMatrix
  
  public Matrix getCovarianceMatrix()
  
  getCovarianceMatrix.
  
  Specified by:
  
  getCovarianceMatrix in interface DataSet
  
  Returns:
  
  the covariance matrix for this dataset. Defers to Statistic.covariance() in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any covariance involving a column with a missing value is Double.NaN. If that's not the desired behavior, missing values can be removed or imputed first.
- getInt
  
  public int getInt(int row, int column)
  
  getInt.
  
  Specified by:
  
  getInt in interface DataSet
  
  Parameters:
  
  row - The index of the case.
  
  column - The index of the variable.
  
  Returns:
  
  the value at the given row and column as an int, rounding if necessary. For discrete variables, this returns the category index of the datum for the variable at that column. Returns DiscreteVariable.MISSING_VALUE for missing values.
- getDouble
  
  public double getDouble(int row, int column)
  
  getDouble.
  
  Specified by:
  
  getDouble in interface DataSet
  
  Parameters:
  
  row - The index of the case.
  
  column - The index of the variable.
  
  Returns:
  
  the value at the given row and column as a double. For discrete data, returns the integer value cast to a double.
- toString
  public String toString()
  
  toString.
  
  Specified by:
  
  toString in interface DataModel
  
  Specified by:
  
  toString in interface DataSet
  
  Overrides:
  
  toString in class Object
  
  Returns:
  
  a string, suitable for printing, of the dataset. Lines are separated by '\n', tokens in the line by whatever character is set in the setOutputDelimiter() method. The list of variables is printed first, followed by one line for each case. This method should probably not be used for saving to files. If that's your goal, use the DataSavers class instead.
  
  See Also:
  
  setOutputDelimiter(Character)
  
  DataWriter
- getDoubleData
  public Matrix getDoubleData()
  
  getDoubleData.
  
  Specified by:
  
  getDoubleData in interface DataSet
  
  Returns:
  
  a copy of the underlying COLT TetradMatrix matrix, containing all of the data in this dataset, discrete data included. Discrete data will be represented by ints cast to doubles. Rows in this matrix are cases, and columns are variables. The list of variable, in the order in which they occur in the matrix, is given by getVariable(). // *
  // * If isMultipliersCollapsed() returns false, multipliers in the dataset are // * first expanded before returning the matrix, so the number of rows in the // * returned matrix may not be the same as the number of rows in this // * dataset.
  
  Throws:
  
  IllegalStateException - if this is not a continuous data set.
  
  See Also:
  
  // * @see #isMulipliersCollapsed()
- subsetColumns
  
  public DataSet subsetColumns(int[] indices)
  
  subsetColumns.
  
  Specified by:
  
  subsetColumns in interface DataSet
  
  Parameters:
  
  indices - an array of objects
  
  Returns:
  
  a new data set in which the the column at indices[i] is placed at index i, for i = 0 to indices.length - 1. (Moved over from Purify.)
- subsetRows
  
  public DataSet subsetRows(int[] rows)
  
  subsetRows.
  
  Specified by:
  
  subsetRows in interface DataSet
  
  Parameters:
  
  rows - an array of objects
  
  Returns:
  
  a DataSet object
- subsetRowsColumns
  
  public DataSet subsetRowsColumns(int[] rows, int[] columns)
  
  Generates a subset of the current DataSet by selecting specified rows and columns.
  
  Specified by:
  
  subsetRowsColumns in interface DataSet
  
  Parameters:
  
  rows - an array of row indices to include in the subset
  
  columns - an array of column indices to include in the subset
  
  Returns:
  
  a new DataSet object containing only the specified rows and columns
- removeCols
  
  public void removeCols(int[] cols)
  
  Removes the given columns from the data set.
  
  Specified by:
  
  removeCols in interface DataSet
  
  Parameters:
  
  cols - an array of objects
- removeRows
  
  public void removeRows(int[] selectedRows)
  
  Removes the given rows from the data set.
  
  Specified by:
  
  removeRows in interface DataSet
  
  Parameters:
  
  selectedRows - an array of objects
- equals
  
  public boolean equals(Object obj)
  
  Checks if the given object is equal to this dataset.
  
  Specified by:
  
  equals in interface DataSet
  
  Overrides:
  
  equals in class Object
  
  Parameters:
  
  obj - The object to check.
  
  Returns:
  
  True if the given object is equal to this dataset.
- copy
  
  public DataSet copy()
  
  Returns a copy of this dataset.
  
  Specified by:
  
  copy in interface DataModel
  
  Specified by:
  
  copy in interface DataSet
  
  Returns:
  
  A copy of this dataset.
- like
  
  public DataSet like()
  
  Returns a dataset with the same dimensions as this dataset, but with no data.
  
  Specified by:
  
  like in interface DataSet
  
  Returns:
  
  a dataset with the same dimensions as this dataset, but with no data.
- setOutputDelimiter
  public void setOutputDelimiter(Character character)
  
  The character used a delimiter when the dataset is output
  Sets the character ('\t', ' ', ',', for instance) that is used to delimit tokens when the data set is printed out using the toString() method.
  
  Specified by:
  
  setOutputDelimiter in interface DataSet
  
  Parameters:
  
  character - The character used as a delimiter when the dataset is output
  
  See Also:
  
  toString()
- permuteRows
  
  public void permuteRows()
  
  Randomly permutes the rows of the dataset.
  
  Specified by:
  
  permuteRows in interface DataSet
- getNumberFormat
  public NumberFormat getNumberFormat()
  
  getNumberFormat.
  
  Specified by:
  
  getNumberFormat in interface DataSet
  
  Returns:
  
  the number format, which by default is the one at NumberFormatUtil.getInstance().getNumberFormat(), but can be set by the user if desired.
  
  See Also:
  
  setNumberFormat(java.text.NumberFormat)
- setNumberFormat
  
  public void setNumberFormat(NumberFormat nf)
  
  The number formatter used to print out continuous values.
  
  Specified by:
  
  setNumberFormat in interface DataSet
  
  Parameters:
  
  nf - The number formatter used to print out continuous values.

Class NumberObjectDataSet

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

NumberObjectDataSet

Method Details

serializableInstance

getColumnToTooltip

getName

setName

getNumColumns

getNumRows

setInt

setDouble

getObject

setObject

getSelectedIndices

getSelectedVariables

addVariable

addVariable

getVariable

getColumn

changeVariable

getVariable

getVariables

getKnowledge

setKnowledge

getVariableNames

setSelected

clearSelection

ensureRows

ensureColumns

existsMissingValue

isSelected

removeColumn

removeColumn

subsetColumns

isContinuous

isDiscrete

isMixed

getCorrelationMatrix

getCovarianceMatrix

getInt

getDouble

toString

getDoubleData

subsetColumns

subsetRows

subsetRowsColumns

removeCols

removeRows

equals

copy

like

setOutputDelimiter

permuteRows

getNumberFormat

setNumberFormat