且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

究竟怎样的k-means ++的工作?

更新时间:2023-12-02 21:48:16

有趣的问题。谢谢你提出这个纸我的注意。原皮 PDF这里。

Interesting question. Thank you for bringing this paper to my attention. PDF link here of the original paper.

在简单来说,聚类中心最初随机选择的一组输入观察矢量,这里选择的概率向量 X 为高,如果 X 不靠近任何previously选择中心。

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

下面是一个一维的例子。我们的观察是[0,1,2,3,4]。让第一中锋, C1 ,为0的概率,未来聚类中心, C2 ,则x是成正比以 || C1-X || ^ 2 。因此,P(C2 = 1)= 1a中,P(C 2 = 2)=4α,P(C2 = 3)= 9a中,P(C 2 = 4)= 16a中,其中a = 1 /(1 + 4 + 9 + 16)。

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

假设C2 = 4。然后,P(C3 = 1)= 1a中,P(C3 = 2)=4α,P(C3 = 3)= 1a中,其中a = 1 /(1 + 4 + 1)。

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

我有codeD Python中的初始化程序;我不知道这是否可以帮助你。

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

编辑与澄清: cumsum 的输出给我们的界限来划分区间[0,1]。这些分区具有长度等于相应点的概率被选择为中心。那么,因为研究均匀之间的选择[0,1],它会下降,因为破发$到这些间隔只有一个( C $ C>)。该循环检查,看看哪个分区研究

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

例如:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3