Package edu.cmu.tetrad.data
Class DataTransforms
java.lang.Object
edu.cmu.tetrad.data.DataTransforms
DataTransforms class.
- Version:
- $Id: $Id
- Author:
- josephramsey
-
Method Summary
Modifier and TypeMethodDescriptionstatic DataSetaddMissingData(DataSet inData, double[] probs) Adds missing data values to cases in accordance with probabilities specified in a double array which has as many elements as there are columns in the input dataset.static double[]center(double[] d) Centers the values in the given array by subtracting the mean of the array from each element.static DataSetSubtracts the mean of each column from each datum that column.center.static MatrixcenterData(Matrix data) centerData.static DataSetconcatenate(DataSet... dataSets) concatenate.static DataSetconcatenate(DataSet dataSet1, DataSet dataSet2) concatenate.static Matrixconcatenate(Matrix... dataSets) concatenate.static DataSetconcatenate(List<DataSet> dataSets) concatenate.static DataSetconvertNumericalDiscreteToContinuous.static voidcopyColumn(Node node, DataSet source, DataSet dest) copyColumn.static ICovarianceMatrixcovarianceNonparanormalDrton(DataSet dataSet) covarianceNonparanormalDrton.static DataSetdiscretize(DataSet dataSet, int numCategories, boolean variablesCopied) discretize.static DataSetgetBootstrapSample(DataSet data, int sampleSize) getBootstrapSample.static DataSetgetBootstrapSample(DataSet data, int sampleSize, org.apache.commons.math3.random.RandomGenerator randomGenerator) Get dataset sampled with replacement.static MatrixgetBootstrapSample(Matrix data, int sampleSize) getBootstrapSample.getConstantColumns(DataSet dataSet) getConstantColumns.static DataSetgetNonparanormalTransformed(DataSet dataSet) getNonparanormalTransformed.static DataSetgetResamplingDataset(DataSet data, int sampleSize) getResamplingDataset.static DataSetgetResamplingDataset(DataSet data, int sampleSize, org.apache.commons.math3.random.RandomGenerator randomGenerator) Get dataset sampled without replacement.static DataSetLog or unlog datastatic DataSetremoveConstantColumns(DataSet dataSet) removeConstantColumns.static DataSetremoveRandomColumns(DataSet dataSet, double aDouble) removeRandomColumns.static DataSetreplaceMissingWithRandom(DataSet inData) replaceMissingWithRandom.static DataSetrestrictToMeasured(DataSet fullDataSet) restrictToMeasured.static doublescale(double value, double dataMin, double dataMax, double scaleMin, double scaleMax) Scales a value from one range to another.static DataSetScales the columns of the provided dataset based on the given scale factors.static DataSetScales the continuous variables in the given DataSet to have values in the range [-1, 1].static voidScales the values of a specified node in the given dataset to a specified range [scaleMin, scaleMax].static DataSetshuffleColumns(DataSet dataModel) shuffleColumns.shuffleColumns2(List<DataSet> dataSets) shuffleColumns2.split.static double[]standardizeData(double[] data) standardizeData.static DataSetstandardizeData(DataSet dataSet) standardizeData.static MatrixstandardizeData(Matrix data) standardizeData.static MatrixstandardizeData(Matrix data, List<Node> variables) Standardizes the columns of the given data matrix by centering and scaling.standardizeData(List<DataSet> dataSets) standardizeData.
-
Method Details
-
logData
Log or unlog data -
standardizeData
standardizeData.
-
standardizeData
standardizeData.
-
center
center.
-
discretize
discretize.
-
convertNumericalDiscreteToContinuous
public static DataSet convertNumericalDiscreteToContinuous(DataSet dataSet) throws NumberFormatException convertNumericalDiscreteToContinuous.
- Parameters:
dataSet- aDataSetobject- Returns:
- a
DataSetobject - Throws:
NumberFormatException- if any.
-
concatenate
concatenate.
-
concatenate
concatenate.
-
concatenate
concatenate.
-
restrictToMeasured
restrictToMeasured.
-
getResamplingDataset
getResamplingDataset.
- Parameters:
data- aDataSetobjectsampleSize- a int- Returns:
- a sample without replacement with the given sample size from the given dataset.
-
getResamplingDataset
public static DataSet getResamplingDataset(DataSet data, int sampleSize, org.apache.commons.math3.random.RandomGenerator randomGenerator) Get dataset sampled without replacement.- Parameters:
data- original datasetsampleSize- number of data (row)randomGenerator- random number generator- Returns:
- dataset
-
getBootstrapSample
getBootstrapSample.
- Parameters:
data- aDataSetobjectsampleSize- a int- Returns:
- a sample with replacement with the given sample size from the given dataset.
-
getBootstrapSample
public static DataSet getBootstrapSample(DataSet data, int sampleSize, org.apache.commons.math3.random.RandomGenerator randomGenerator) Get dataset sampled with replacement.- Parameters:
data- original datasetsampleSize- number of data (row)randomGenerator- random number generator- Returns:
- dataset
-
split
split.
-
center
Subtracts the mean of each column from each datum that column. -
shuffleColumns
shuffleColumns.
-
shuffleColumns2
shuffleColumns2.
-
covarianceNonparanormalDrton
covarianceNonparanormalDrton.
- Parameters:
dataSet- aDataSetobject- Returns:
- a
ICovarianceMatrixobject
-
getNonparanormalTransformed
getNonparanormalTransformed.
-
removeConstantColumns
removeConstantColumns.
-
getConstantColumns
getConstantColumns.
-
removeRandomColumns
removeRandomColumns.
-
standardizeData
standardizeData.
-
standardizeData
Standardizes the columns of the given data matrix by centering and scaling. For each column representing a continuous variable, the method calculates the mean and standard deviation, subtracts the mean from each value, and divides by the standard deviation. Discrete variables are ignored.- Parameters:
data- The input data matrix to be standardized. Each column corresponds to a variable, and each row represents an observation.variables- A list of nodes representing the variables in the data. The type of each variable (e.g., continuous or discrete) determines whether the variable will be standardized.- Returns:
- A new standardized data matrix where each continuous variable has been mean-centered and normalized by its standard deviation.
-
standardizeData
public static double[] standardizeData(double[] data) standardizeData.
- Parameters:
data- an array of objects- Returns:
- an array of objects
-
center
public static double[] center(double[] d) Centers the values in the given array by subtracting the mean of the array from each element.- Parameters:
d- the array of double values to be centered- Returns:
- a new array where each element is the original value minus the mean of the input array
-
centerData
centerData.
-
concatenate
concatenate.
-
getBootstrapSample
getBootstrapSample.
- Parameters:
data- aMatrixobjectsampleSize- a int- Returns:
- a sample with replacement with the given sample size from the given dataset.
-
copyColumn
copyColumn.
-
addMissingData
Adds missing data values to cases in accordance with probabilities specified in a double array which has as many elements as there are columns in the input dataset. Hence, if the first element of the array of probabilities is alpha, then the first column will contain a -99 (or other missing value code) in a given case with probability alpha. This method will be useful in generating datasets which can be used to test algorithm that handle missing data and/or latent variables. Author: Frank Wimberly- Parameters:
inData- The data to which random missing data is to be added.probs- The probability of adding missing data to each column.- Returns:
- The new data sets with missing data added.
-
replaceMissingWithRandom
replaceMissingWithRandom.
-
scale
Scales the continuous variables in the given DataSet to have values in the range [-1, 1].For each continuous column, the method computes the maximum of the absolute values of the minimum and maximum of the column, and divides all values in that column by this maximum value. Discrete columns are not affected.
- Parameters:
dataSet- The DataSet containing variables to be scaled.scaleMin- The minimum value to scale to.scaleMax- The maximum value to scale to.- Returns:
- A new DataSet with scaled continuous variables, while discrete variables remain unchanged.
-
scale
Scales the values of a specified node in the given dataset to a specified range [scaleMin, scaleMax]. This method only processes nodes that are instances of ContinuousVariable.- Parameters:
dataSet- the dataset containing the values to be scaledscaleMin- the minimum value of the target rangescaleMax- the maximum value of the target rangenode- the node corresponding to the column in the dataset to be scaled
-
scale
public static double scale(double value, double dataMin, double dataMax, double scaleMin, double scaleMax) Scales a value from one range to another.- Parameters:
value- The value to scaledataMin- The minimum value of the data rangedataMax- The maximum value of the data rangescaleMin- The minimum value of the scale rangescaleMax- The maximum value of the scale range- Returns:
- The scaled value
- Throws:
IllegalArgumentException- if dataMin is equal to dataMax
-
scale
Scales the columns of the provided dataset based on the given scale factors. Only continuous variables in the dataset are scaled. Discrete variables are ignored. The method returns a new dataset with scaled values, leaving the original dataset unmodified.- Parameters:
dataSet- the input dataset to be scaledscales- an array of scale factors, where each scale corresponds to a column in the dataset- Returns:
- a new dataset with the continuous columns scaled by the given factors
-