Class DataUtils

java.lang.Object
edu.cmu.tetrad.data.DataUtils

public final class DataUtils extends Object
Some static utility methods for dealing with data sets.
Author:
Various folks.
  • Constructor Details

    • DataUtils

      public DataUtils()
  • Method Details

    • copyColumn

      public static void copyColumn(Node node, DataSet source, DataSet dest)
    • isBinary

      public static boolean isBinary(DataSet data, int column)
      States whether the given column of the given data set is binary.
      Parameters:
      data - Ibid.
      column - Ibid.
      Returns:
      true iff the column is binary.
    • defaultCategory

      public static String defaultCategory(int index)
      Parameters:
      index - Ond plus the given index.
      Returns:
      the default category for index i. (The default category should ALWAYS be obtained by calling this method.)
    • addMissingData

      public static DataSet addMissingData(DataSet inData, double[] probs)
      Adds missing data values to cases in accordance with probabilities specified in a double array which has as many elements as there are columns in the input dataset. Hence, if the first element of the array of probabilities is alpha, then the first column will contain a -99 (or other missing value code) in a given case with probability alpha. This method will be useful in generating datasets which can be used to test algorithm that handle missing data and/or latent variables. Author: Frank Wimberly
      Parameters:
      inData - The data to which random missing data is to be added.
      probs - The probability of adding missing data to each column.
      Returns:
      The new data sets with missing data added.
    • replaceMissingWithRandom

      public static DataSet replaceMissingWithRandom(DataSet inData)
    • discreteSerializableInstance

      public static DataSet discreteSerializableInstance()
      A discrete data set used to construct some other serializable instances.
    • containsMissingValue

      public static boolean containsMissingValue(Matrix data)
      Returns:
      true iff the data sets contains a missing value.
    • containsMissingValue

      public static boolean containsMissingValue(DataSet data)
    • logData

      public static DataSet logData(DataSet dataSet, double a, boolean isUnlog, int base)
      Log or unlog data
    • standardizeData

      public static Matrix standardizeData(Matrix data)
    • standardizeData

      public static double[] standardizeData(double[] data)
    • standardizeData

      public static cern.colt.list.DoubleArrayList standardizeData(cern.colt.list.DoubleArrayList data)
    • standardizeData

      public static List<DataSet> standardizeData(List<DataSet> dataSets)
    • standardizeData

      public static DataSet standardizeData(DataSet dataSet)
    • center

      public static double[] center(double[] d)
    • centerData

      public static Matrix centerData(Matrix data)
    • center

      public static List<DataSet> center(List<DataSet> dataList)
    • discretize

      public static DataSet discretize(DataSet dataSet, int numCategories, boolean variablesCopied)
    • createContinuousVariables

      public static List<Node> createContinuousVariables(String[] varNames)
    • subMatrix

      public static Matrix subMatrix(ICovarianceMatrix m, Node x, Node y, List<Node> z)
      Returns:
      the submatrix of m with variables in the order of the x variables.
    • subMatrix

      public static Matrix subMatrix(Matrix m, List<Node> variables, Node x, Node y, List<Node> z)
      Returns:
      the submatrix of m with variables in the order of the x variables.
    • subMatrix

      public static Matrix subMatrix(Matrix m, Map<Node,Integer> indexMap, Node x, Node y, List<Node> z)
      Returns:
      the submatrix of m with variables in the order of the x variables.
    • subMatrix

      public static Matrix subMatrix(ICovarianceMatrix m, Map<Node,Integer> indexMap, Node x, Node y, List<Node> z)
      Returns:
      the submatrix of m with variables in the order of the x variables.
    • convertNumericalDiscreteToContinuous

      public static DataSet convertNumericalDiscreteToContinuous(DataSet dataSet) throws NumberFormatException
      Throws:
      NumberFormatException
    • concatenate

      public static DataSet concatenate(DataSet dataSet1, DataSet dataSet2)
    • concatenate

      public static DataSet concatenate(DataSet... dataSets)
    • concatenate

      public static Matrix concatenate(Matrix... dataSets)
    • concatenate

      public static DataSet concatenate(List<DataSet> dataSets)
    • restrictToMeasured

      public static DataSet restrictToMeasured(DataSet fullDataSet)
    • means

      public static Vector means(Matrix data)
    • means

      public static Vector means(double[][] data)
      Column major data.
    • cov

      public static Matrix cov(Matrix data)
    • mean

      public static Vector mean(Matrix data)
    • choleskySimulation

      public static DataSet choleskySimulation(CovarianceMatrix cov)
      Parameters:
      cov - The variables and covariance matrix over the variables.
      Returns:
      The simulated data.
    • getBootstrapSample

      public static Matrix getBootstrapSample(Matrix data, int sampleSize)
      Returns:
      a sample with replacement with the given sample size from the given dataset.
    • getResamplingDataset

      public static DataSet getResamplingDataset(DataSet data, int sampleSize)
      Returns:
      a sample without replacement with the given sample size from the given dataset.
    • getResamplingDataset

      public static DataSet getResamplingDataset(DataSet data, int sampleSize, org.apache.commons.math3.random.RandomGenerator randomGenerator)
      Get dataset sampled without replacement.
      Parameters:
      data - original dataset
      sampleSize - number of data (row)
      randomGenerator - random number generator
      Returns:
      dataset
    • getBootstrapSample

      public static DataSet getBootstrapSample(DataSet data, int sampleSize)
      Returns:
      a sample with replacement with the given sample size from the given dataset.
    • getBootstrapSample

      public static DataSet getBootstrapSample(DataSet data, int sampleSize, org.apache.commons.math3.random.RandomGenerator randomGenerator)
      Get dataset sampled with replacement.
      Parameters:
      data - original dataset
      sampleSize - number of data (row)
      randomGenerator - random number generator
      Returns:
      dataset
    • split

      public static List<DataSet> split(DataSet data, double percentTest)
    • center

      public static DataSet center(DataSet data)
      Subtracts the mean of each column from each datum that column.
    • shuffleColumns

      public static DataSet shuffleColumns(DataSet dataModel)
    • shuffleColumns2

      public static List<DataSet> shuffleColumns2(List<DataSet> dataSets)
    • covarianceNonparanormalDrton

      public static ICovarianceMatrix covarianceNonparanormalDrton(DataSet dataSet)
    • getNonparanormalTransformed

      public static DataSet getNonparanormalTransformed(DataSet dataSet)
    • removeConstantColumns

      public static DataSet removeConstantColumns(DataSet dataSet)
    • getEss

      public static double getEss(ICovarianceMatrix covariances)
      Returns the equivalent sample size, assuming all units are equally correlated and all unit variances are equal.