且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用 Python 中的科学库执行卡方拟合优度检验?

更新时间:2022-10-27 10:47:24

等概率 bin 的近似解决方案:

  • 估计分布的参数
  • 如果是 scipy.stats.distribution,则使用逆 cdf ppf 来获取常规概率网格的 binedge,例如distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
  • 然后,使用 np.histogram 计算每个 bin 中的观察次数

然后对频率使用卡方检验.

另一种方法是从排序数据的百分位数中找到 bin 边缘,并使用 cdf 来查找实际概率.

这只是近似值,因为卡方检验的理论假设参数是通过分箱数据的最大似然估计的.而且我不确定基于数据的binedges选择是否会影响渐近分布.

我好久没有研究这个了.如果近似解决方案不够好,那么我建议您在 stats.stackexchange 上提问.

Let's assume I have some data I obtained empirically:

from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)

It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?

I can fit a model with:

param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))

It is very elegant to calculate the Kolmogorov-Smirnov test.

>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)

However, I can't find a good way of calculating the chi-squared test.

There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).

The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.

Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.

An approximate solution for equal probability bins:

  • Estimate the parameters of the distribution
  • Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
  • Then, use np.histogram to count the number of observations in each bin

then use chisquare test on the frequencies.

An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.

This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.

I haven't looked into this into a long time. If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.