Class BoxDataSet
- All Implemented Interfaces:
DataModel
,DataSet
,KnowledgeTransferable
,VariableSource
,TetradSerializable
,Serializable
ContinuousVariable
or a DiscreteVariable
. This
class violates object orientation in that the underlying data matrix is retrievable using the getDoubleData() method.
This is allowed so that external calculations may be performed on large datasets without having to allocate extra
memory. If this matrix needs to be modified externally, please consider making a copy of it first, using the
TetradMatrix copy() method.
The data set may be given a name; this name is not used internally.
The data set has a list of variables associated with it, as described above. This list is coordinated with the stored data, in that data for the i'th variable will be in the i'th column.
A subset of variables in the data set may be designated as selected. This selection set is stored with the data set
and may be manipulated using the
select
and deselect
methods.
A multiplicity m_i may be associated with each case c_i in the dataset, which is interpreted to mean that that c_i occurs m_i times in the dataset.
Knowledge may be associated with the data set, using the
setKnowledge
method. This knowledge is not used internally to
the data set, but it may be retrieved by algorithms and used.
- Version:
- $Id: $Id
- Author:
- josephramsey
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionBoxDataSet
(BoxDataSet dataSet) Makes of copy of the given data set.BoxDataSet
(DataBox dataBox, List<Node> variables) Constructs a new data set with the given number of rows and columns, with all values set to missing. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addVariable
(int index, Node variable) Adds the given variable at the given index.void
addVariable
(Node variable) Adds the given variable to the data set.void
changeVariable
(Node from, Node to) Changes the variable for the given column fromfrom
toto
.void
Marks all variables as deselected.copy()
Returns a copy of this dataset.void
ensureColumns
(int columns, List<String> excludedVariableNames) Ensures that the dataset has at leastcolumns
columns.void
ensureRows
(int rows) Ensures that the dataset has at leastrows
rows.boolean
Checks if the given object is equal to this dataset.boolean
existsMissingValue.int
getColumn.Getter for the fieldcolumnToTooltip
.getCorrelationMatrix.getCovarianceMatrix.Getter for the fielddataBox
.double
getDouble
(int row, int column) getDouble.getDoubleData.int
getInt
(int row, int column) getInt.Getter for the fieldknowledge
.getName()
Gets the name of the data set.getNumberFormat.int
getNumColumns.int
getNumRows.getObject
(int row, int col) getObject.int[]
getSelectedIndices.getVariable
(int col) getVariable.getVariable
(String varName) getVariable.getVariableNames.Getter for the fieldvariables
.boolean
isContinuous.boolean
isDiscrete.boolean
isMixed()
isMixed.boolean
isSelected
(Node variable) isSelected.like()
Returns a dataset with the same dimensions as this dataset, but with no data.void
Randomly permutes the rows of the dataset.void
removeCols
(int[] cols) Removes the given columns from the data set.void
removeColumn
(int index) Removes the variable (and data) at the given index.void
removeColumn
(Node variable) Removes the given variable, along with all of its data.void
removeRows
(int[] selectedRows) Removes the specified rows from the dataBox, updates the selection, multipliers, and knowledge accordingly.static BoxDataSet
Generates a simple exemplar of this class to test serialization.void
setDouble
(int row, int column, double value) Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.void
setInt
(int row, int column, int value) Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.void
setKnowledge
(Knowledge knowledge) Sets knowledge to a copy of the given object.void
Sets the name of the data model (may be null).void
The number formatter used to print out continuous values.void
Sets the value at the given (row, column) to the given value.void
setOutputDelimiter
(Character character) The character used a delimiter when the dataset is outputvoid
setSelected
(Node variable, boolean selected) Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.subsetColumns
(int[] indices) Creates a new DataSet object containing only the specified columns.subsetColumns
(List<Node> vars) Creates and returns a dataset consisting of those variables in the list vars.subsetRows
(int[] rows) Creates a subset of rows from the existing DataSet.subsetRowsColumns
(int[] rows, int[] columns) Generates a subset of the current DataSet by selecting specified rows and columns.toString()
toString.
-
Constructor Details
-
BoxDataSet
-
BoxDataSet
Makes of copy of the given data set.- Parameters:
dataSet
- The data set to copy.
-
-
Method Details
-
serializableInstance
Generates a simple exemplar of this class to test serialization.- Returns:
- A simple exemplar of this class.
-
getColumnToTooltip
-
getName
-
setName
-
getNumColumns
public int getNumColumns()getNumColumns.
- Specified by:
getNumColumns
in interfaceDataSet
- Returns:
- the number of variables in the data set.
-
getNumRows
public int getNumRows()getNumRows.
- Specified by:
getNumRows
in interfaceDataSet
- Returns:
- the number of rows in the rectangular data set, which is the maximum of the number of rows in the list of wrapped columns.
-
setInt
public void setInt(int row, int column, int value) Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.Sets the value at the given (row, column) to the given int value, assuming the variable for the column is discrete.
-
setDouble
public void setDouble(int row, int column, double value) Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.Sets the value at the given (row, column) to the given double value, assuming the variable for the column is continuous.
-
getObject
getObject.
- Specified by:
getObject
in interfaceDataSet
- Parameters:
row
- The index of the case.col
- The index of the variable.- Returns:
- the value at the given row and column as an Object. The type returned is deliberately vague, allowing for variables of any type. Primitives will be returned as corresponding wrapping objects (for example, doubles as Doubles).
-
setObject
-
getSelectedIndices
public int[] getSelectedIndices()getSelectedIndices.
- Specified by:
getSelectedIndices
in interfaceDataSet
- Returns:
- the indices of the currently selected variables.
-
addVariable
Adds the given variable to the data set.Adds the given variable to the data set, increasing the number of columns by one, moving columns i >=
index
to column i + 1, and inserting a column of missing values at column i.- Specified by:
addVariable
in interfaceDataSet
- Parameters:
variable
- The variable to add.
-
addVariable
Adds the given variable at the given index.Adds the given variable to the dataset, increasing the number of columns by one, moving columns i >=
index
to column i + 1, and inserting a column of missing values at column i.- Specified by:
addVariable
in interfaceDataSet
- Parameters:
index
- The index at which to add the variable.variable
- The variable to add.
-
getVariable
getVariable.
getVariable.
- Specified by:
getVariable
in interfaceDataSet
- Parameters:
col
- The index of the variable.- Returns:
- the variable at the given column.
-
getColumn
-
changeVariable
Changes the variable for the given column fromfrom
toto
. Supported currently only for discrete variables.Changes the variable for the given column from
from
toto
. Supported currently only for discrete variables.- Specified by:
changeVariable
in interfaceDataSet
- Parameters:
from
- The variable to change.to
- The variable to change to.
-
getVariable
getVariable.
- Specified by:
getVariable
in interfaceDataModel
- Specified by:
getVariable
in interfaceDataSet
- Parameters:
varName
- aString
object- Returns:
- the variable with the given name, or null if no such variable exists.
-
getVariables
Getter for the field
variables
.- Specified by:
getVariables
in interfaceDataSet
- Specified by:
getVariables
in interfaceVariableSource
- Returns:
- (a copy of) the List of Variables for the data set, in the order of their columns.
-
getKnowledge
Getter for the field
knowledge
.- Specified by:
getKnowledge
in interfaceKnowledgeTransferable
- Returns:
- a copy of the knowledge associated with this data set. (Cannot be null.)
-
setKnowledge
Sets knowledge to a copy of the given object.Sets knowledge to be associated with this data set. May not be null.
- Specified by:
setKnowledge
in interfaceKnowledgeTransferable
- Parameters:
knowledge
- the knowledge to set
-
getVariableNames
getVariableNames.
- Specified by:
getVariableNames
in interfaceDataSet
- Specified by:
getVariableNames
in interfaceVariableSource
- Returns:
- (a copy of) the List of Variables for the data set, in the order of their columns.
-
setSelected
Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.Marks the given column as selected if 'selected' is true or deselected if 'selected' is false.
- Specified by:
setSelected
in interfaceDataSet
- Parameters:
variable
- The variable to select or deselect.selected
- True to select the variable, false to deselect it.
-
clearSelection
public void clearSelection()Marks all variables as deselected.- Specified by:
clearSelection
in interfaceDataSet
-
ensureRows
public void ensureRows(int rows) Ensures that the dataset has at leastrows
rows. Used for pasting data into the dataset.Ensures that the dataset has at least the number of rows, adding rows if necessary to make that the case. The new rows will be filled with missing values.
- Specified by:
ensureRows
in interfaceDataSet
- Parameters:
rows
- The number of rows to ensure.
-
ensureColumns
Ensures that the dataset has at leastcolumns
columns. Used for pasting data into the dataset. When creating new columns, names in theexcludedVariableNames
list may not be used. The purpose of this is to allow these names to be set later by the calling class, without incurring conflicts.Ensures that the dataset has at least the given number of columns, adding continuous variables with unique names until that is true. The new columns will be filled with missing values.
- Specified by:
ensureColumns
in interfaceDataSet
- Parameters:
columns
- The number of columns to ensure.excludedVariableNames
- The names of variables that should not be used for new columns.
-
existsMissingValue
public boolean existsMissingValue()existsMissingValue.
- Specified by:
existsMissingValue
in interfaceDataSet
- Returns:
- true if and only if this data set contains at least one missing value.
-
isSelected
isSelected.
- Specified by:
isSelected
in interfaceDataSet
- Parameters:
variable
- The variable to check.- Returns:
- true iff the given column has been marked as selected.
-
removeColumn
public void removeColumn(int index) Removes the variable (and data) at the given index.Removes the column for the variable at the given index, reducing the number of columns by one.
- Specified by:
removeColumn
in interfaceDataSet
- Parameters:
index
- The index of the variable to remove.
-
removeColumn
Removes the given variable, along with all of its data.Removes the columns for the given variable from the dataset, reducing the number of columns by one.
- Specified by:
removeColumn
in interfaceDataSet
- Parameters:
variable
- The variable to remove.
-
subsetColumns
Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.Creates and returns a dataset consisting of those variables in the list vars. Vars must be a subset of the variables of this DataSet. The ordering of the elements of vars will be the same as in the list of variables in this DataSet.
- Specified by:
subsetColumns
in interfaceDataSet
- Parameters:
vars
- The variables to include in the new data set.- Returns:
- a new data set consisting of the variables in the list vars.
-
isContinuous
public boolean isContinuous()isContinuous.
- Specified by:
isContinuous
in interfaceDataModel
- Specified by:
isContinuous
in interfaceDataSet
- Returns:
- true iff this is a continuous data set--that is, if every column in it is continuous. (By implication, empty datasets are both discrete and continuous.)
-
isDiscrete
public boolean isDiscrete()isDiscrete.
- Specified by:
isDiscrete
in interfaceDataModel
- Specified by:
isDiscrete
in interfaceDataSet
- Returns:
- true iff this is a discrete data set--that is, if every column in it is discrete. (By implication, empty datasets are both discrete and continuous.)
-
isMixed
-
getCorrelationMatrix
getCorrelationMatrix.
- Specified by:
getCorrelationMatrix
in interfaceDataSet
- Returns:
- the correlation matrix for this dataset. Defers to
Statistic.covariance()
in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any off-diagonal correlation involving a column with a missing value is Double.NaN, although all of the on-diagonal elements are 1.0. If that's not the desired behavior, missing values can be removed or imputed first.
-
getCovarianceMatrix
getCovarianceMatrix.
- Specified by:
getCovarianceMatrix
in interfaceDataSet
- Returns:
- the covariance matrix for this dataset. Defers to
Statistic.covariance()
in the COLT matrix library, so it inherits the handling of missing values from that library--that is, any covariance involving a column with a missing value is Double.NaN. If that's not the desired behavior, missing values can be removed or imputed first.
-
getInt
public int getInt(int row, int column) getInt.
- Specified by:
getInt
in interfaceDataSet
- Parameters:
row
- The index of the case.column
- The index of the variable.- Returns:
- the value at the given row and column as an int, rounding if necessary. For discrete variables, this returns the category index of the datum for the variable at that column. Returns DiscreteVariable.MISSING_VALUE for missing values.
-
getDouble
public double getDouble(int row, int column) getDouble.
-
toString
toString.
- Specified by:
toString
in interfaceDataModel
- Specified by:
toString
in interfaceDataSet
- Overrides:
toString
in classObject
- Returns:
- a string, suitable for printing, of the dataset. Lines are separated by '\n', tokens in the line by
whatever character is set in the
setOutputDelimiter()
method. The list of variables is printed first, followed by one line for each case. This method should probably not be used for saving to files. If that's your goal, use the DataSavers class instead. - See Also:
-
getDoubleData
getDoubleData.
- Specified by:
getDoubleData
in interfaceDataSet
- Returns:
- a copy of the underlying COLT TetradMatrix matrix, containing all of the data in this dataset, discrete
data included. Discrete data will be represented by ints cast to doubles. Rows in this matrix are cases, and
columns are variables. The list of variable, in the order in which they occur in the matrix, is given by
getVariables().
If isMultipliersCollapsed() returns false, multipliers in the dataset are first expanded before returning the matrix, so the number of rows in the returned matrix may not be the same as the number of rows in this dataset.
- Throws:
IllegalStateException
- if this is not a continuous data set.- See Also:
-
subsetColumns
Creates a new DataSet object containing only the specified columns.- Specified by:
subsetColumns
in interfaceDataSet
- Parameters:
indices
- An array of integers representing the indices of the columns to include in the subset.- Returns:
- A new DataSet containing only the specified columns.
-
subsetRows
Creates a subset of rows from the existing DataSet.- Specified by:
subsetRows
in interfaceDataSet
- Parameters:
rows
- An array of integers representing the indices of the rows to be included in the subset.- Returns:
- A new DataSet object containing only the specified rows from the original DataSet.
-
subsetRowsColumns
Generates a subset of the current DataSet by selecting specified rows and columns.- Specified by:
subsetRowsColumns
in interfaceDataSet
- Parameters:
rows
- an array of row indices to include in the subsetcolumns
- an array of column indices to include in the subset- Returns:
- a new DataSet object containing only the specified rows and columns
-
removeCols
public void removeCols(int[] cols) Removes the given columns from the data set.Removes the given columns from the data set.
- Specified by:
removeCols
in interfaceDataSet
- Parameters:
cols
- The indices of the columns to remove.
-
removeRows
public void removeRows(int[] selectedRows) Removes the specified rows from the dataBox, updates the selection, multipliers, and knowledge accordingly.- Specified by:
removeRows
in interfaceDataSet
- Parameters:
selectedRows
- an array of integers representing the indices of the rows to be removed from the dataBox
-
equals
-
copy
-
like
-
setOutputDelimiter
The character used a delimiter when the dataset is outputSets the character ('\t', ' ', ',', for instance) that is used to delimit tokens when the data set is printed out using the toString() method.
- Specified by:
setOutputDelimiter
in interfaceDataSet
- Parameters:
character
- The character used as a delimiter when the dataset is output- See Also:
-
permuteRows
public void permuteRows()Randomly permutes the rows of the dataset.- Specified by:
permuteRows
in interfaceDataSet
-
getNumberFormat
getNumberFormat.
- Specified by:
getNumberFormat
in interfaceDataSet
- Returns:
- the number format, which by default is the one at
NumberFormatUtil.getInstance().getNumberFormat()
, but can be set by the user if desired. - See Also:
-
setNumberFormat
The number formatter used to print out continuous values.Sets the number format to be used when printing out the data set. The default is the one at
- Specified by:
setNumberFormat
in interfaceDataSet
- Parameters:
nf
- The number formatter used to print out continuous values.
-
getDataBox
Getter for the field
dataBox
.- Returns:
- the data box that holds the data for this data set.
-