更新时间:2023-12-02 21:43:46
有趣的问题.感谢您引起我的注意- K-Means ++:优势播种的过程
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
简而言之,聚类中心最初是从一组输入观察向量中随机选择的,如果x
不在先前选择的中心附近,则选择向量x
的可能性就很高.
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x
is high if x
is not near any previously chosen centers.
这里是一维示例.我们的观察值为[0,1,2,3,4].假设第一个中心c1
为0.下一个聚类中心c2
为x的概率与||c1-x||^2
成正比.因此,P(c2 = 1)= 1a,P(c2 = 2)= 4a,P(c2 = 3)= 9a,P(c2 = 4)= 16a,其中a = 1/(1 + 4 + 9 + 16).
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1
, be 0. The probability that the next cluster center, c2
, is x is proportional to ||c1-x||^2
. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
假设c2 = 4.然后,P(c3 = 1)= 1a,P(c3 = 2)= 4a,P(c3 = 3)= 1a,其中a = 1/(1 + 4 + 1).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
我已经用Python编写了初始化过程;我不知道这是否对您有帮助.
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
编辑需要澄清:cumsum
的输出为我们提供了划分区间[0,1]的边界.这些分区的长度等于相应点被选择为中心的概率.因此,由于r
在[0,1]之间统一选择,因此它将恰好落入这些间隔之一(由于break
). for
循环检查以查看r
在哪个分区中.
EDIT with clarification: The output of cumsum
gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r
is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break
). The for
loop checks to see which partition r
is in.
示例:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3