且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用TensorFlow实施k-means?

更新时间:2023-12-02 22:18:22

(注意:您现在可以获取 a github上的要点是该代码的更完善的版本.)

您绝对可以做到,但是您需要定义自己的优化标准(对于k均值,通常是最大迭代次数,并且分配稳定时).这是一个示例,您可能会如何做(可能有更佳的实现方法,当然还有更好的选择初始点的方法).如果您真的想尽力避免在python中迭代地做事,那基本上就像您要在numpy中这样做:

you can definitely do it, but you need to define your own optimization criteria (for k-means, it's usually a max iteration count and when the assignment stabilizes). Here's an example of how you might do it (there are probably more optimal ways to implement it, and definitely better ways to select the initial points). It's basically like you'd do it in numpy if you were trying really hard to stay away from doing things iteratively in python:

import tensorflow as tf
import numpy as np
import time

N=10000
K=4
MAX_ITERS = 1000

start = time.time()

points = tf.Variable(tf.random_uniform([N,2]))
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))

# Silly initialization:  Use the first two points as the starting                
# centroids.  In the real world, do this better.                                 
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))

# Replicate to N copies of each centroid and K copies of each                    
# point, then subtract and compute the sum of squared distances.                 
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
                            reduction_indices=2)

# Use argmin to select the lowest-distance point                                 
best_centroids = tf.argmin(sum_squares, 1)
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids,
                                                    cluster_assignments))

def bucket_mean(data, bucket_ids, num_buckets):
    total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
    count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
    return total / count

means = bucket_mean(points, best_centroids, K)

# Do not write to the assigned clusters variable until after                     
# computing whether the assignments have changed - hence with_dependencies
with tf.control_dependencies([did_assignments_change]):
    do_updates = tf.group(
        centroids.assign(means),
        cluster_assignments.assign(best_centroids))

sess = tf.Session()
sess.run(tf.initialize_all_variables())

changed = True
iters = 0

while changed and iters < MAX_ITERS:
    iters += 1
    [changed, _] = sess.run([did_assignments_change, do_updates])

[centers, assignments] = sess.run([centroids, cluster_assignments])
end = time.time()
print ("Found in %.2f seconds" % (end-start)), iters, "iterations"
print "Centroids:"
print centers
print "Cluster assignments:", assignments

(请注意,实际的实现方式在选择初始集群时需要格外小心,避免所有点都集中到一个集群的情况发生,等等.这只是一个快速演示.我已经从早些时候更新了我的答案,以使它更清晰,更值得榜样".)

(Note that a real implementation would need to be more careful about initial cluster selection, avoiding problem cases with all points going to one cluster, etc. This is just a quick demo. I've updated my answer from earlier to make it a bit more clear and "example-worthy".)