public class KolmogorovSmirnovTest extends Object
The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
 Two-sample tests are also supported, evaluating the null hypothesis that the two samples
 x and y come from the same underlying distribution. In this case, the test
 statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length of x, \(m\) is
 the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of
 the values in x and \(F_m\) is the empirical distribution of the y values. The
 default 2-sample test method, kolmogorovSmirnovTest(double[], double[]) works as
 follows:
 
LARGE_SAMPLE_PRODUCT), the method presented in [4] is used to compute the
 exact p-value for the 2-sample test.LARGE_SAMPLE_PRODUCT, the asymptotic
 distribution of \(D_{n,m}\) is used. See approximateP(double, int, int) for details on
 the approximation.
 If the product of the sample sizes is less than LARGE_SAMPLE_PRODUCT and the sample
 data contains ties, random jitter is added to the sample data to break ties before applying
 the algorithm above. Alternatively, the bootstrap(double[], double[], int, boolean)
 method, modeled after ks.boot
 in the R Matching package [3], can be used if ties are known to be present in the data.
 
 In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value
 associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \)
 by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean
 strict parameter. This parameter is ignored for large samples.
 
The methods used by the 2-sample default implementation are also exposed directly:
exactP(double, int, int, boolean) computes exact 2-sample p-valuesapproximateP(double, int, int) uses the asymptotic distribution The boolean
 arguments in the first two methods allow the probability used to estimate the p-value to be
 expressed using strict or non-strict inequality. See
 kolmogorovSmirnovTest(double[], double[], boolean).References:
Note that [1] contains an error in computing h, refer to MATH-437 for details.
| Modifier and Type | Field and Description | 
|---|---|
| protected static double | KS_SUM_CAUCHY_CRITERIONConvergence criterion for  ksSum(double, double, int) | 
| protected static int | LARGE_SAMPLE_PRODUCTWhen product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic
 distribution to compute the p-value. | 
| protected static int | MAXIMUM_PARTIAL_SUM_COUNTBound on the number of partial sums in  ksSum(double, double, int) | 
| protected static double | PG_SUM_RELATIVE_ERRORConvergence criterion for the sums in #pelzGood(double, double, int)} | 
| Constructor and Description | 
|---|
| KolmogorovSmirnovTest()Construct a KolmogorovSmirnovTest instance. | 
| KolmogorovSmirnovTest(long seed)Construct a KolmogorovSmirnovTest instance providing a seed for the PRNG
 used by the  bootstrap(double[], double[], int)method. | 
| Modifier and Type | Method and Description | 
|---|---|
| double | approximateP(double d,
            int n,
            int m)Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\)
 is the 2-sample Kolmogorov-Smirnov statistic. | 
| double | bootstrap(double[] x,
         double[] y,
         int iterations)Computes  bootstrap(x, y, iterations, true). | 
| double | bootstrap(double[] x,
         double[] y,
         int iterations,
         boolean strict)Estimates the p-value of a two-sample
  Kolmogorov-Smirnov test
 evaluating the null hypothesis that  xandyare samples drawn from the same
 probability distribution. | 
| double | cdf(double d,
   int n)Calculates  P(D_n < d)using the method described in [1] with quick decisions for extreme
 values given in [2] (see above). | 
| double | cdf(double d,
   int n,
   boolean exact)Calculates  P(D_n < d)using method described in [1] with quick decisions for extreme
 values given in [2] (see above). | 
| double | cdfExact(double d,
        int n)Calculates  P(D_n < d). | 
| double | exactP(double d,
      int n,
      int m,
      boolean strict)Computes \(P(D_{n,m} > d)\) if  strictistrue; otherwise \(P(D_{n,m} \ge
 d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. | 
| double | kolmogorovSmirnovStatistic(double[] x,
                          double[] y)Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
 where \(n\) is the length of  x, \(m\) is the length ofy, \(F_n\) is the
 empirical distribution that puts mass \(1/n\) at each of the values inxand \(F_m\)
 is the empirical distribution of theyvalues. | 
| double | kolmogorovSmirnovStatistic(RealDistribution distribution,
                          double[] data)Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where
 \(F\) is the distribution (cdf) function associated with  distribution, \(n\) is the
 length ofdataand \(F_n\) is the empirical distribution that puts mass \(1/n\) at
 each of the values indata. | 
| double | kolmogorovSmirnovTest(double[] x,
                     double[] y)Computes the p-value, or observed significance level, of a two-sample  Kolmogorov-Smirnov test
 evaluating the null hypothesis that  xandyare samples drawn from the same
 probability distribution. | 
| double | kolmogorovSmirnovTest(double[] x,
                     double[] y,
                     boolean strict)Computes the p-value, or observed significance level, of a two-sample  Kolmogorov-Smirnov test
 evaluating the null hypothesis that  xandyare samples drawn from the same
 probability distribution. | 
| double | kolmogorovSmirnovTest(RealDistribution distribution,
                     double[] data)Computes the p-value, or observed significance level, of a one-sample  Kolmogorov-Smirnov test
 evaluating the null hypothesis that  dataconforms todistribution. | 
| double | kolmogorovSmirnovTest(RealDistribution distribution,
                     double[] data,
                     boolean exact)Computes the p-value, or observed significance level, of a one-sample  Kolmogorov-Smirnov test
 evaluating the null hypothesis that  dataconforms todistribution. | 
| boolean | kolmogorovSmirnovTest(RealDistribution distribution,
                     double[] data,
                     double alpha)Performs a  Kolmogorov-Smirnov
 test evaluating the null hypothesis that  dataconforms todistribution. | 
| double | ksSum(double t,
     double tolerance,
     int maxIterations)Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial
 sums are within  toleranceof one another, or whenmaxIterationspartial sums
 have been computed. | 
| double | pelzGood(double d,
        int n)Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc. | 
protected static final int MAXIMUM_PARTIAL_SUM_COUNT
ksSum(double, double, int)protected static final double KS_SUM_CAUCHY_CRITERION
ksSum(double, double, int)protected static final double PG_SUM_RELATIVE_ERROR
protected static final int LARGE_SAMPLE_PRODUCT
public KolmogorovSmirnovTest()
public KolmogorovSmirnovTest(long seed)
bootstrap(double[], double[], int) method.seed - the seed for the PRNGpublic double kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)
data conforms to distribution. If
 exact is true, the distribution used to compute the p-value is computed using
 extended precision. See cdfExact(double, int).distribution - reference distributiondata - sample being being evaluatedexact - whether or not to force exact computation of the p-valuedata is a sample from
         distributionMathIllegalArgumentException - if data does not have length at least 2NullArgumentException - if data is nullpublic double kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)
distribution, \(n\) is the
 length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at
 each of the values in data.distribution - reference distributiondata - sample being evaluatedMathIllegalArgumentException - if data does not have length at least 2NullArgumentException - if data is nullpublic double kolmogorovSmirnovTest(double[] x,
                                    double[] y,
                                    boolean strict)
x and y are samples drawn from the same
 probability distribution. Specifically, what is returned is an estimate of the probability
 that the kolmogorovSmirnovStatistic(double[], double[]) associated with a randomly
 selected partition of the combined sample into subsamples of sizes x.length and
 y.length will strictly exceed (if strict is true) or be at least as
 large as strict = false) as kolmogorovSmirnovStatistic(x, y).
 LARGE_SAMPLE_PRODUCT), the exact p-value is computed using the method presented
 in [4], implemented in exactP(double, int, int, boolean). LARGE_SAMPLE_PRODUCT, the
 asymptotic distribution of \(D_{n,m}\) is used. See approximateP(double, int, int)
 for details on the approximation.
 If x.length * y.length < LARGE_SAMPLE_PRODUCT and the combined set of values in
 x and y contains ties, random jitter is added to x and y to
 break ties before computing \(D_{n,m}\) and the p-value. The jitter is uniformly distributed
 on (-minDelta / 2, minDelta / 2) where minDelta is the smallest pairwise difference between
 values in the combined sample.
 If ties are known to be present in the data, bootstrap(double[], double[], int, boolean)
 may be used as an alternative method for estimating the p-value.
x - first sample datasety - second sample datasetstrict - whether or not the probability to compute is expressed as a strict inequality
        (ignored for large samples)x and y represent
         samples from the same distributionMathIllegalArgumentException - if either x or y does not have length at
         least 2NullArgumentException - if either x or y is nullbootstrap(double[], double[], int, boolean)public double kolmogorovSmirnovTest(double[] x,
                                    double[] y)
x and y are samples drawn from the same
 probability distribution. Assumes the strict form of the inequality used to compute the
 p-value. See kolmogorovSmirnovTest(RealDistribution, double[], boolean).x - first sample datasety - second sample datasetx and y represent
         samples from the same distributionMathIllegalArgumentException - if either x or y does not have length at
         least 2NullArgumentException - if either x or y is nullpublic double kolmogorovSmirnovStatistic(double[] x,
                                         double[] y)
x, \(m\) is the length of y, \(F_n\) is the
 empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\)
 is the empirical distribution of the y values.x - first sampley - second samplex and
         y represent samples from the same underlying distributionMathIllegalArgumentException - if either x or y does not have length at
         least 2NullArgumentException - if either x or y is nullpublic double kolmogorovSmirnovTest(RealDistribution distribution, double[] data)
data conforms to distribution.distribution - reference distributiondata - sample being being evaluateddata is a sample from
         distributionMathIllegalArgumentException - if data does not have length at least 2NullArgumentException - if data is nullpublic boolean kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)
data conforms to distribution.distribution - reference distributiondata - sample being being evaluatedalpha - significance level of the testdata is a sample from distribution
         can be rejected with confidence 1 - alphaMathIllegalArgumentException - if data does not have length at least 2NullArgumentException - if data is nullpublic double bootstrap(double[] x,
                        double[] y,
                        int iterations,
                        boolean strict)
x and y are samples drawn from the same
 probability distribution. This method estimates the p-value by repeatedly sampling sets of size
 x.length and y.length from the empirical distribution of the combined sample.
 When strict is true, this is equivalent to the algorithm implemented in the R function
 ks.boot, described in Jasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 1-52.
x - first sampley - second sampleiterations - number of bootstrap resampling iterationsstrict - whether or not the null hypothesis is expressed as a strict inequalitypublic double bootstrap(double[] x,
                        double[] y,
                        int iterations)
bootstrap(x, y, iterations, true).
 This is equivalent to ks.boot(x,y, nboots=iterations) using the R Matching
 package function. See #bootstrap(double[], double[], int, boolean).x - first sampley - second sampleiterations - number of bootstrap resampling iterationspublic double cdf(double d,
                  int n)
           throws MathRuntimeException
P(D_n < d) using the method described in [1] with quick decisions for extreme
 values given in [2] (see above). The result is not exact as with
 cdfExact(double, int) because calculations are based on
 double rather than BigFraction.d - statisticn - sample sizeMathRuntimeException - if algorithm fails to convert h to a
         BigFraction in expressing d as \((k
         - h) / m\) for integer k, m and \(0 <= h < 1\)public double cdfExact(double d,
                       int n)
                throws MathRuntimeException
P(D_n < d). The result is exact in the sense that BigFraction/BigReal is
 used everywhere at the expense of very slow execution time. Almost never choose this in real
 applications unless you are very sure; this is almost solely for verification purposes.
 Normally, you would choose cdf(double, int). See the class
 javadoc for definitions and algorithm description.d - statisticn - sample sizeMathRuntimeException - if the algorithm fails to convert h to a
         BigFraction in expressing d as \((k
         - h) / m\) for integer k, m and \(0 <= h < 1\)public double cdf(double d,
                  int n,
                  boolean exact)
           throws MathRuntimeException
P(D_n < d) using method described in [1] with quick decisions for extreme
 values given in [2] (see above).d - statisticn - sample sizeexact - whether the probability should be calculated exact using
        BigFraction everywhere at the expense of
        very slow execution time, or if double should be used convenient places to
        gain speed. Almost never choose true in real applications unless you are very
        sure; true is almost solely for verification purposes.MathRuntimeException - if algorithm fails to convert h to a
         BigFraction in expressing d as \((k
         - h) / m\) for integer k, m and \(0 \lt;= h < 1\).public double pelzGood(double d,
                       int n)
d - value of d-statistic (x in [2])n - sample sizepublic double ksSum(double t,
                    double tolerance,
                    int maxIterations)
tolerance of one another, or when maxIterations partial sums
 have been computed. If the sum does not converge before maxIterations iterations a
 MathIllegalStateException is thrown.t - argumenttolerance - Cauchy criterion for partial sumsmaxIterations - maximum number of partial sums to computeMathIllegalStateException - if the series does not convergepublic double exactP(double d,
                     int n,
                     int m,
                     boolean strict)
strict is true; otherwise \(P(D_{n,m} \ge
 d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See
 kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
 The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] from the class javadoc.
d - D-statistic valuen - first sample sizem - second sample sizestrict - whether or not the probability to compute is expressed as a strict inequalitydpublic double approximateP(double d,
                           int n,
                           int m)
kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
 
 Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2
 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See ksSum(double, double, int) for
 details on how convergence of the sum is determined. This implementation passes ksSum
 KS_SUM_CAUCHY_CRITERION as tolerance and
 MAXIMUM_PARTIAL_SUM_COUNT as maxIterations.
 
d - D-statistic valuen - first sample sizem - second sample sizedCopyright © 2016–2020 Hipparchus.org. All rights reserved.