Class KolmogorovSmirnovTest
- java.lang.Object
-
- org.hipparchus.stat.inference.KolmogorovSmirnovTest
-
public class KolmogorovSmirnovTest extends Object
Implementation of the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Two-sample tests are also supported, evaluating the null hypothesis that the two samples
x
andy
come from the same underlying distribution. In this case, the test statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values. The default 2-sample test method,kolmogorovSmirnovTest(double[], double[])
works as follows:- For small samples (where the product of the sample sizes is less than
LARGE_SAMPLE_PRODUCT
), the method presented in [4] is used to compute the exact p-value for the 2-sample test. - When the product of the sample sizes exceeds
LARGE_SAMPLE_PRODUCT
, the asymptotic distribution of \(D_{n,m}\) is used. SeeapproximateP(double, int, int)
for details on the approximation.
If the product of the sample sizes is less than
LARGE_SAMPLE_PRODUCT
and the sample data contains ties, random jitter is added to the sample data to break ties before applying the algorithm above. Alternatively, thebootstrap(double[], double[], int, boolean)
method, modeled after ks.boot in the R Matching package [3], can be used if ties are known to be present in the data.In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \) by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean
strict
parameter. This parameter is ignored for large samples.The methods used by the 2-sample default implementation are also exposed directly:
exactP(double, int, int, boolean)
computes exact 2-sample p-valuesapproximateP(double, int, int)
uses the asymptotic distribution Theboolean
arguments in the first two methods allow the probability used to estimate the p-value to be expressed using strict or non-strict inequality. SeekolmogorovSmirnovTest(double[], double[], boolean)
.
References:
- [1] Evaluating Kolmogorov's Distribution by George Marsaglia, Wai Wan Tsang, and Jingbo Wang
- [2] Computing the Two-Sided Kolmogorov-Smirnov Distribution by Richard Simard and Pierre L'Ecuyer
- [3] Jasjeet S. Sekhon. 2011. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R Journal of Statistical Software, 42(7): 1-52.
- [4] Kim, P. J. and Jennrich, R. I. (1970). Tables of the Exact Sampling Distribution of the Two-Sample Kolmogorov-Smirnov Criterion D_mn ,m≦n in Selected Tables in Mathematical Statistics, Vol. 1, H. L. Harter and D. B. Owen, editors.
Note that [1] contains an error in computing h, refer to MATH-437 for details.
- For small samples (where the product of the sample sizes is less than
-
-
Field Summary
Fields Modifier and Type Field Description protected static double
KS_SUM_CAUCHY_CRITERION
Convergence criterion forksSum(double, double, int)
protected static int
LARGE_SAMPLE_PRODUCT
When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution to compute the p-value.protected static int
MAXIMUM_PARTIAL_SUM_COUNT
Bound on the number of partial sums inksSum(double, double, int)
protected static double
PG_SUM_RELATIVE_ERROR
Convergence criterion for the sums in #pelzGood(double, double, int)}
-
Constructor Summary
Constructors Constructor Description KolmogorovSmirnovTest()
Construct a KolmogorovSmirnovTest instance.KolmogorovSmirnovTest(long seed)
Construct a KolmogorovSmirnovTest instance providing a seed for the PRNG used by thebootstrap(double[], double[], int)
method.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description double
approximateP(double d, int n, int m)
Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.double
bootstrap(double[] x, double[] y, int iterations)
Computesbootstrap(x, y, iterations, true)
.double
bootstrap(double[] x, double[] y, int iterations, boolean strict)
Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
cdf(double d, int n)
CalculatesP(D_n < d)
using the method described in [1] with quick decisions for extreme values given in [2] (see above).double
cdf(double d, int n, boolean exact)
CalculatesP(D_n < d)
using method described in [1] with quick decisions for extreme values given in [2] (see above).double
cdfExact(double d, int n)
CalculatesP(D_n < d)
.double
exactP(double d, int n, int m, boolean strict)
Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.double
kolmogorovSmirnovStatistic(double[] x, double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values.double
kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)
Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated withdistribution
, \(n\) is the length ofdata
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata
.double
kolmogorovSmirnovTest(double[] x, double[] y)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.double
kolmogorovSmirnovTest(RealDistribution distribution, double[] data)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.double
kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.boolean
kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)
Performs a Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.double
ksSum(double t, double tolerance, int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are withintolerance
of one another, or whenmaxIterations
partial sums have been computed.double
pelzGood(double d, int n)
Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.
-
-
-
Field Detail
-
MAXIMUM_PARTIAL_SUM_COUNT
protected static final int MAXIMUM_PARTIAL_SUM_COUNT
Bound on the number of partial sums inksSum(double, double, int)
- See Also:
- Constant Field Values
-
KS_SUM_CAUCHY_CRITERION
protected static final double KS_SUM_CAUCHY_CRITERION
Convergence criterion forksSum(double, double, int)
- See Also:
- Constant Field Values
-
PG_SUM_RELATIVE_ERROR
protected static final double PG_SUM_RELATIVE_ERROR
Convergence criterion for the sums in #pelzGood(double, double, int)}- See Also:
- Constant Field Values
-
LARGE_SAMPLE_PRODUCT
protected static final int LARGE_SAMPLE_PRODUCT
When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic distribution to compute the p-value.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
KolmogorovSmirnovTest
public KolmogorovSmirnovTest()
Construct a KolmogorovSmirnovTest instance.
-
KolmogorovSmirnovTest
public KolmogorovSmirnovTest(long seed)
Construct a KolmogorovSmirnovTest instance providing a seed for the PRNG used by thebootstrap(double[], double[], int)
method.- Parameters:
seed
- the seed for the PRNG
-
-
Method Detail
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
. Ifexact
is true, the distribution used to compute the p-value is computed using extended precision. SeecdfExact(double, int)
.- Parameters:
distribution
- reference distributiondata
- sample being being evaluatedexact
- whether or not to force exact computation of the p-value- Returns:
- the p-value associated with the null hypothesis that
data
is a sample fromdistribution
- Throws:
MathIllegalArgumentException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)
Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated withdistribution
, \(n\) is the length ofdata
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values indata
.- Parameters:
distribution
- reference distributiondata
- sample being evaluated- Returns:
- Kolmogorov-Smirnov statistic \(D_n\)
- Throws:
MathIllegalArgumentException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that thekolmogorovSmirnovStatistic(double[], double[])
associated with a randomly selected partition of the combined sample into subsamples of sizesx.length
andy.length
will strictly exceed (ifstrict
istrue
) or be at least as large asstrict = false
) askolmogorovSmirnovStatistic(x, y)
.- For small samples (where the product of the sample sizes is less than
LARGE_SAMPLE_PRODUCT
), the exact p-value is computed using the method presented in [4], implemented inexactP(double, int, int, boolean)
. - When the product of the sample sizes exceeds
LARGE_SAMPLE_PRODUCT
, the asymptotic distribution of \(D_{n,m}\) is used. SeeapproximateP(double, int, int)
for details on the approximation.
If
x.length * y.length
<LARGE_SAMPLE_PRODUCT
and the combined set of values inx
andy
contains ties, random jitter is added tox
andy
to break ties before computing \(D_{n,m}\) and the p-value. The jitter is uniformly distributed on (-minDelta / 2, minDelta / 2) where minDelta is the smallest pairwise difference between values in the combined sample.If ties are known to be present in the data,
bootstrap(double[], double[], int, boolean)
may be used as an alternative method for estimating the p-value.- Parameters:
x
- first sample datasety
- second sample datasetstrict
- whether or not the probability to compute is expressed as a strict inequality (ignored for large samples)- Returns:
- p-value associated with the null hypothesis that
x
andy
represent samples from the same distribution - Throws:
MathIllegalArgumentException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null- See Also:
bootstrap(double[], double[], int, boolean)
- For small samples (where the product of the sample sizes is less than
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(double[] x, double[] y)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the p-value. SeekolmogorovSmirnovTest(RealDistribution, double[], boolean)
.- Parameters:
x
- first sample datasety
- second sample dataset- Returns:
- p-value associated with the null hypothesis that
x
andy
represent samples from the same distribution - Throws:
MathIllegalArgumentException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null
-
kolmogorovSmirnovStatistic
public double kolmogorovSmirnovStatistic(double[] x, double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution of they
values.- Parameters:
x
- first sampley
- second sample- Returns:
- test statistic \(D_{n,m}\) used to evaluate the null hypothesis that
x
andy
represent samples from the same underlying distribution - Throws:
MathIllegalArgumentException
- if eitherx
ory
does not have length at least 2NullArgumentException
- if eitherx
ory
is null
-
kolmogorovSmirnovTest
public double kolmogorovSmirnovTest(RealDistribution distribution, double[] data)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.- Parameters:
distribution
- reference distributiondata
- sample being being evaluated- Returns:
- the p-value associated with the null hypothesis that
data
is a sample fromdistribution
- Throws:
MathIllegalArgumentException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
kolmogorovSmirnovTest
public boolean kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)
Performs a Kolmogorov-Smirnov test evaluating the null hypothesis thatdata
conforms todistribution
.- Parameters:
distribution
- reference distributiondata
- sample being being evaluatedalpha
- significance level of the test- Returns:
- true iff the null hypothesis that
data
is a sample fromdistribution
can be rejected with confidence 1 -alpha
- Throws:
MathIllegalArgumentException
- ifdata
does not have length at least 2NullArgumentException
- ifdata
is null
-
bootstrap
public double bootstrap(double[] x, double[] y, int iterations, boolean strict)
Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution. This method estimates the p-value by repeatedly sampling sets of sizex.length
andy.length
from the empirical distribution of the combined sample. Whenstrict
is true, this is equivalent to the algorithm implemented in the R functionks.boot
, described inJasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 1-52.
- Parameters:
x
- first sampley
- second sampleiterations
- number of bootstrap resampling iterationsstrict
- whether or not the null hypothesis is expressed as a strict inequality- Returns:
- estimated p-value
-
bootstrap
public double bootstrap(double[] x, double[] y, int iterations)
Computesbootstrap(x, y, iterations, true)
. This is equivalent to ks.boot(x,y, nboots=iterations) using the R Matching package function. See #bootstrap(double[], double[], int, boolean).- Parameters:
x
- first sampley
- second sampleiterations
- number of bootstrap resampling iterations- Returns:
- estimated p-value
-
cdf
public double cdf(double d, int n) throws MathRuntimeException
CalculatesP(D_n < d)
using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as withcdfExact(double, int)
because calculations are based ondouble
rather thanBigFraction
.- Parameters:
d
- statisticn
- sample size- Returns:
- \(P(D_n < d)\)
- Throws:
MathRuntimeException
- if algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 <= h < 1\)
-
cdfExact
public double cdfExact(double d, int n) throws MathRuntimeException
CalculatesP(D_n < d)
. The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choosecdf(double, int)
. See the class javadoc for definitions and algorithm description.- Parameters:
d
- statisticn
- sample size- Returns:
- \(P(D_n < d)\)
- Throws:
MathRuntimeException
- if the algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 <= h < 1\)
-
cdf
public double cdf(double d, int n, boolean exact) throws MathRuntimeException
CalculatesP(D_n < d)
using method described in [1] with quick decisions for extreme values given in [2] (see above).- Parameters:
d
- statisticn
- sample sizeexact
- whether the probability should be calculated exact usingBigFraction
everywhere at the expense of very slow execution time, or ifdouble
should be used convenient places to gain speed. Almost never choosetrue
in real applications unless you are very sure;true
is almost solely for verification purposes.- Returns:
- \(P(D_n < d)\)
- Throws:
MathRuntimeException
- if algorithm fails to converth
to aBigFraction
in expressingd
as \((k - h) / m\) for integerk, m
and \(0 \lt;= h < 1\).
-
pelzGood
public double pelzGood(double d, int n)
Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.- Parameters:
d
- value of d-statistic (x in [2])n
- sample size- Returns:
- \(P(D_n < d)\)
-
ksSum
public double ksSum(double t, double tolerance, int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are withintolerance
of one another, or whenmaxIterations
partial sums have been computed. If the sum does not converge beforemaxIterations
iterations aMathIllegalStateException
is thrown.- Parameters:
t
- argumenttolerance
- Cauchy criterion for partial sumsmaxIterations
- maximum number of partial sums to compute- Returns:
- Kolmogorov sum evaluated at t
- Throws:
MathIllegalStateException
- if the series does not converge
-
exactP
public double exactP(double d, int n, int m, boolean strict)
Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] from the class javadoc.
- Parameters:
d
- D-statistic valuen
- first sample sizem
- second sample sizestrict
- whether or not the probability to compute is expressed as a strict inequality- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than (resp. greater than or equal to)
d
-
approximateP
public double approximateP(double d, int n, int m)
Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. SeekolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See
ksSum(double, double, int)
for details on how convergence of the sum is determined. This implementation passesksSum
KS_SUM_CAUCHY_CRITERION
astolerance
andMAXIMUM_PARTIAL_SUM_COUNT
asmaxIterations
.- Parameters:
d
- D-statistic valuen
- first sample sizem
- second sample size- Returns:
- approximate probability that a randomly selected m-n partition of m + n generates
\(D_{n,m}\) greater than
d
-
-