Class EmpiricalDistribution

All Implemented Interfaces:
Serializable, RealDistribution

public class EmpiricalDistribution extends AbstractRealDistribution

Represents an empirical probability distribution -- a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from.

An EmpiricalDistribution maintains data structures, called distribution digests, that describe empirical distributions and support the following operations:

  • loading the distribution from a file of observed data values
  • dividing the input data into "bin ranges" and reporting bin frequency counts (data for histogram)
  • reporting univariate statistics describing the full set of data values as well as the observations within each bin
  • generating random values from the distribution

Applications can use EmpiricalDistribution to build grouped frequency histograms representing the input data or to generate random values "like" those in the input file -- i.e., the values generated will follow the distribution of the values in the file.

The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing:

Digesting the input file

  1. Pass the file once to compute min and max.
  2. Divide the range from min-max into binCount "bins."
  3. Pass the data file again, computing bin counts and univariate statistics (mean, std dev.) for each of the bins
  4. Divide the interval (0,1) into subintervals associated with the bins, with the length of a bin's subinterval proportional to its count.
Generating random values from the distribution
  1. Generate a uniformly distributed value in (0,1)
  2. Select the subinterval to which the value belongs.
  3. Generate a random Gaussian value with mean = mean of the associated bin and std dev = std dev of associated bin.

EmpiricalDistribution implements the RealDistribution interface as follows. Given x within the range of values in the dataset, let B be the bin containing x and let K be the within-bin kernel for B. Let P(B-) be the sum of the probabilities of the bins below B and let K(B) be the mass of B under K (i.e., the integral of the kernel density over B). Then set P(X < x) = P(B-) + P(B) * K(x) / K(B) where K(x) is the kernel distribution evaluated at x. This results in a cdf that matches the grouped frequency distribution at the bin endpoints and interpolates within bins using within-bin kernels.

USAGE NOTES:

  • The binCount is set by default to 1000. A good rule of thumb is to set the bin count to approximately the length of the input file divided by 10.
  • The input file must be a plain text file containing one valid numeric entry per line.
See Also:
  • Field Details

    • DEFAULT_BIN_COUNT

      public static final int DEFAULT_BIN_COUNT
      Default bin count
      See Also:
    • randomData

      protected final RandomDataGenerator randomData
      RandomDataGenerator instance to use in repeated calls to getNext()
  • Constructor Details

    • EmpiricalDistribution

      public EmpiricalDistribution()
      Creates a new EmpiricalDistribution with the default bin count.
    • EmpiricalDistribution

      public EmpiricalDistribution(int binCount)
      Creates a new EmpiricalDistribution with the specified bin count.
      Parameters:
      binCount - number of bins. Must be strictly positive.
      Throws:
      MathIllegalArgumentException - if binCount <= 0.
    • EmpiricalDistribution

      public EmpiricalDistribution(int binCount, RandomGenerator generator)
      Creates a new EmpiricalDistribution with the specified bin count using the provided RandomGenerator as the source of random data.
      Parameters:
      binCount - number of bins. Must be strictly positive.
      generator - random data generator (may be null, resulting in default JDK generator)
      Throws:
      MathIllegalArgumentException - if binCount <= 0.
    • EmpiricalDistribution

      public EmpiricalDistribution(RandomGenerator generator)
      Creates a new EmpiricalDistribution with default bin count using the provided RandomGenerator as the source of random data.
      Parameters:
      generator - random data generator (may be null, resulting in default JDK generator)
  • Method Details

    • load

      public void load(double[] in) throws NullArgumentException
      Computes the empirical distribution from the provided array of numbers.
      Parameters:
      in - the input data array
      Throws:
      NullArgumentException - if in is null
    • load

      Computes the empirical distribution using data read from a URL.

      The input file must be an ASCII text file containing one valid numeric entry per line.

      Parameters:
      url - url of the input file
      Throws:
      IOException - if an IO error occurs
      NullArgumentException - if url is null
      MathIllegalArgumentException - if URL contains no data
    • load

      public void load(File file) throws IOException, NullArgumentException
      Computes the empirical distribution from the input file.

      The input file must be an ASCII text file containing one valid numeric entry per line.

      Parameters:
      file - the input file
      Throws:
      IOException - if an IO error occurs
      NullArgumentException - if file is null
    • getNextValue

      public double getNextValue() throws MathIllegalStateException
      Generates a random value from this distribution. Preconditions:
      • the distribution must be loaded before invoking this method
      Returns:
      the random value.
      Throws:
      MathIllegalStateException - if the distribution has not been loaded
    • getSampleStats

      public StatisticalSummary getSampleStats()
      Returns a StatisticalSummary describing this distribution. Preconditions:
      • the distribution must be loaded before invoking this method
      Returns:
      the sample statistics
      Throws:
      IllegalStateException - if the distribution has not been loaded
    • getBinCount

      public int getBinCount()
      Returns the number of bins.
      Returns:
      the number of bins.
    • getBinStats

      public List<StreamingStatistics> getBinStats()
      Returns a List of StreamingStatistics instances containing statistics describing the values in each of the bins. The list is indexed on the bin number.
      Returns:
      List of bin statistics.
    • getUpperBounds

      public double[] getUpperBounds()

      Returns a fresh copy of the array of upper bounds for the bins. Bins are:
      [min,upperBounds[0]],(upperBounds[0],upperBounds[1]],..., (upperBounds[binCount-2], upperBounds[binCount-1] = max].

      Returns:
      array of bin upper bounds
    • getGeneratorUpperBounds

      public double[] getGeneratorUpperBounds()

      Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts.

      Preconditions:
      • the distribution must be loaded before invoking this method
      Returns:
      array of upper bounds of subintervals used in data generation
      Throws:
      NullPointerException - unless a load method has been called beforehand.
    • isLoaded

      public boolean isLoaded()
      Property indicating whether or not the distribution has been loaded.
      Returns:
      true if the distribution has been loaded
    • reSeed

      public void reSeed(long seed)
      Reseeds the random number generator used by getNextValue().
      Parameters:
      seed - random generator seed
    • density

      public double density(double x)
      Returns the probability density function (PDF) of this distribution evaluated at the specified point x. In general, the PDF is the derivative of the CDF. If the derivative does not exist at x, then an appropriate replacement should be returned, e.g. Double.POSITIVE_INFINITY, Double.NaN, or the limit inferior or limit superior of the difference quotient.

      Returns the kernel density normalized so that its integral over each bin equals the bin mass.

      Algorithm description:

      1. Find the bin B that x belongs to.
      2. Compute K(B) = the mass of B with respect to the within-bin kernel (i.e., the integral of the kernel density over B).
      3. Return k(x) * P(B) / K(B), where k is the within-bin kernel density and P(B) is the mass of B.
      Parameters:
      x - the point at which the PDF is evaluated
      Returns:
      the value of the probability density function at point x
    • cumulativeProbability

      public double cumulativeProbability(double x)
      For a random variable X whose values are distributed according to this distribution, this method returns P(X <= x). In other words, this method represents the (cumulative) distribution function (CDF) for this distribution.

      Algorithm description:

      1. Find the bin B that x belongs to.
      2. Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B.
      3. Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B
      4. Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x.

      If K is a constant distribution, we return P(B-) + P(B) (counting the full mass of B).

      Parameters:
      x - the point at which the CDF is evaluated
      Returns:
      the probability that a random variable with this distribution takes a value less than or equal to x
    • inverseCumulativeProbability

      public double inverseCumulativeProbability(double p) throws MathIllegalArgumentException
      Computes the quantile function of this distribution. For a random variable X distributed according to this distribution, the returned value is
      • inf{x in R | P(X<=x) >= p} for 0 < p <= 1,
      • inf{x in R | P(X<=x) > 0} for p = 0.
      The default implementation returns

      Algorithm description:

      1. Find the smallest i such that the sum of the masses of the bins through i is at least p.
      2. Let K be the within-bin kernel distribution for bin i.
        Let K(B) be the mass of B under K.
        Let K(B-) be K evaluated at the lower endpoint of B (the combined mass of the bins below B under K).
        Let P(B) be the probability of bin i.
        Let P(B-) be the sum of the bin masses below bin i.
        Let pCrit = p - P(B-)
      3. Return the inverse of K evaluated at
        K(B-) + pCrit * K(B) / P(B)
      Specified by:
      inverseCumulativeProbability in interface RealDistribution
      Overrides:
      inverseCumulativeProbability in class AbstractRealDistribution
      Parameters:
      p - the cumulative probability
      Returns:
      the smallest p-quantile of this distribution (largest 0-quantile for p = 0)
      Throws:
      MathIllegalArgumentException - if p < 0 or p > 1
    • getNumericalMean

      public double getNumericalMean()
      Use this method to get the numerical value of the mean of this distribution.
      Returns:
      the mean or Double.NaN if it is not defined
    • getNumericalVariance

      public double getNumericalVariance()
      Use this method to get the numerical value of the variance of this distribution.
      Returns:
      the variance (possibly Double.POSITIVE_INFINITY as for certain cases in TDistribution) or Double.NaN if it is not defined
    • getSupportLowerBound

      public double getSupportLowerBound()
      Access the lower bound of the support. This method must return the same value as inverseCumulativeProbability(0). In other words, this method must return

      inf {x in R | P(X <= x) > 0}.

      Returns:
      lower bound of the support (might be Double.NEGATIVE_INFINITY)
    • getSupportUpperBound

      public double getSupportUpperBound()
      Access the upper bound of the support. This method must return the same value as inverseCumulativeProbability(1). In other words, this method must return

      inf {x in R | P(X <= x) = 1}.

      Returns:
      upper bound of the support (might be Double.POSITIVE_INFINITY)
    • isSupportConnected

      public boolean isSupportConnected()
      Use this method to get information about whether the support is connected, i.e. whether all values between the lower and upper bound of the support are included in the support.
      Returns:
      whether the support is connected or not
    • reseedRandomGenerator

      public void reseedRandomGenerator(long seed)
      Reseed the underlying PRNG.
      Parameters:
      seed - new seed value
    • getKernel

      protected RealDistribution getKernel(StreamingStatistics bStats)
      The within-bin smoothing kernel. Returns a Gaussian distribution parameterized by bStats, unless the bin contains less than 2 observations, in which case a constant distribution is returned.
      Parameters:
      bStats - summary statistics for the bin
      Returns:
      within-bin kernel parameterized by bStats