Class Discretizer

java.lang.Object
edu.cmu.tetrad.data.Discretizer

public class Discretizer extends Object
Discretizes individual columns of discrete or continuous data. Continuous data is discretized by specifying a list of n - 1 cutoffs for n values in the discretized data, with optional string labels for these values. Discrete data is discretized by specifying a mapping from old value names to new value names, the idea being that old values may be merged.
Version:
$Id: $Id
Author:
josephramsey, Tyler Gibson
  • Constructor Details

    • Discretizer

      public Discretizer(DataSet dataSet)
      Constructs a new discretizer that discretizes every variable as binary, using evenly distributed values.
      Parameters:
      dataSet - a DataSet object
    • Discretizer

      public Discretizer(DataSet dataSet, Map<Node,DiscretizationSpec> specs)

      Constructor for Discretizer.

      Parameters:
      dataSet - a DataSet object
      specs - a Map object
  • Method Details

    • getEqualFrequencyBreakPoints

      public static double[] getEqualFrequencyBreakPoints(double[] _data, int numberOfCategories)

      getEqualFrequencyBreakPoints.

      Parameters:
      _data - an array of double objects
      numberOfCategories - a int
      Returns:
      an array of double objects
    • discretize

      public static Discretizer.Discretization discretize(double[] _data, double[] cutoffs, String variableName, List<String> categories)
      Discretizes the continuous data in the given column using the specified cutoffs and category names. The following scheme is used. If cutoffs[i - 1] < v <= cutoffs[i] (where cutoffs[-1] = negative infinity), then v is mapped to category i. If category names are supplied, the discrete column returned will use these category names.
      Parameters:
      cutoffs - The cutoffs used to discretize the data. Should have length c - 1, where c is the number of categories in the discretized data.
      variableName - the name of the returned variable.
      categories - An optional list of category names; may be null. If this is supplied, the discrete column returned will use these category names. If this is non-null, it must have length c, where c is the number of categories for the discretized data. If any category names are null, default category names will be used for those.
      _data - an array of double objects
      Returns:
      The discretized column.
    • equalCounts

      public void equalCounts(Node node, int numCategories)
      Sets the given node to discretized using evenly distributed values using the given number of categories.
      Parameters:
      node - a Node object
      numCategories - a int
    • equalIntervals

      public void equalIntervals(Node node, int numCategories)
      Sets the given node to discretized using evenly spaced intervals using the given number of categories.
      Parameters:
      node - a Node object
      numCategories - a int
    • setVariablesCopied

      public void setVariablesCopied(boolean unselectedVariabledCopied)

      Setter for the field variablesCopied.

      Parameters:
      unselectedVariabledCopied - a boolean
    • discretize

      public DataSet discretize()

      discretize.

      Returns:
      - Discretized dataset.