且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何实现K-Means ++算法?

更新时间:2023-12-02 21:43:46

有趣的问题.感谢您引起我的注意- K-Means ++:优势播种的过程

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

简而言之,聚类中心最初是从一组输入观察向量中随机选择的,如果x不在先前选择的中心附近,则选择向量x的可能性就很高.

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

这里是一维示例.我们的观察值为[0,1,2,3,4].假设第一个中心c1为0.下一个聚类中心c2为x的概率与||c1-x||^2成正比.因此,P(c2 = 1)= 1a,P(c2 = 2)= 4a,P(c2 = 3)= 9a,P(c2 = 4)= 16a,其中a = 1/(1 + 4 + 9 + 16).

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

假设c2 = 4.然后,P(c3 = 1)= 1a,P(c3 = 2)= 4a,P(c3 = 3)= 1a,其中a = 1/(1 + 4 + 1).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

我已经用Python编写了初始化过程;我不知道这是否对您有帮助.

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

编辑需要澄清:cumsum的输出为我们提供了划分区间[0,1]的边界.这些分区的长度等于相应点被选择为中心的概率.因此,由于r在[0,1]之间统一选择,因此它将恰好落入这些间隔之一(由于break). for循环检查以查看r在哪个分区中.

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

示例:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3