且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

层次聚类树状图中的节点索引

更新时间:2022-12-30 18:36:48

我可以从发布的材料中推断出我希望您已经注意到的内容:

  1. 链接矩阵节点按聚类顺序编号.这原始节点为 0-14.[0 3] 成为节点 15,[1 8] 成为节点 16,[6 [1 8]] 是节点 17,依此类推
  2. 树状图节点从根开始编号,深度优先,上(右)分支在左(下)之前,标签从 N-1 到 0.
  3. 左右由链接矩阵中的列决定.

链接矩阵有2N+1行:N+1个原始节点和N个聚类.要重建树状图标签,请从矩阵的最后一行开始,[26 27].这将获得标签 N-1,或 13.在右侧节点(第 1 列)上重复,然后在左侧(第 0 列)上重复.每次分配标签值时递减它.

这足够清楚了吗?

递归深度对您来说是个问题?对于 N 个节点,树深度不应超过 log2(N),对于一百万个节点,树深度不应超过 22-25.您的运行时堆栈无法处理?我的默认是一千.

无论如何,转换为迭代并不那么难.从根节点开始,创建一堆未解析的节点.

stack = [root]虽然堆栈不为空:节点 = stack.pop()stack.push (node.right)stack.push (node.left)process node # 分配节点号等.

这使得维护节点计数器变得更加容易,因为它现在是一个局部变量.还要注意,这是一个基本的图搜索算法:当我们有节点时,抓住列表中的下一个,将它的邻居推到列表上(但检查每个节点是否已经被处理),并处理当前节点."

对于深度优先,使用堆栈;对于广度优先,使用队列.

I'm looking to annotate a hierarchical clustering dendrogram, but I have some trouble associating the node indices produced by scipy.cluster.hierarchy.dendrogram when plotting, to the node indices in the original linkage matrix (e.g. produced with scipy.cluster.hierarchy.linkage).

For instance, say we have the following example (adapted from this SO question),

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
%matplotlib inline

# generate two clusters: a with 10 points, b with 5:
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
# make distances between pairs of children uniform 
# (re-scales the horizontal (distance) axis when plotting)
Z[:,2] = np.arange(Z.shape[0])+1 

def plot_dendrogram(linkage_matrix, **kwargs):

    ddata = dendrogram(linkage_matrix, **kwargs)
    idx = 0
    for i, d, c in zip(ddata['icoord'], ddata['dcoord'], 
                       ddata['color_list']):       
        x = 0.5 * sum(i[1:3])
        y = d[1]
        plt.plot(y, x, 'o', c=c)
        plt.annotate("%.3g" % idx, (y, x), xytext=(15, 5),
                             textcoords='offset points',
                             va='top', ha='center')
        idx += 1
plot_dendrogram(Z, labels=np.arange(X.shape[0]),
                truncate_mode='level', show_leaf_counts=False, 
               orientation='left')

which produces the following dendrogram:

The original X matrix has (X.shape[0] == 15) samples, and the tick labels on the vertical axis corresponds to the sample id for each tree leaf. The number at each node is the id of that node as returned by the dendrogram function. Now if we look at the original linkage matrix, the 1st two columns give the children of each tree node,

print(Z[:,:2].astype('int'))
  [[ 0  3]
   [ 1  8]
   [ 6 16]
   [ 2  5]
   ...
   [22 24]
   [23 25]
   [26 27]]

For instance, the node 0 in the linkage matrix has for children leaves [0, 3], but on the dendrogram above it is labeled as number 9. Similarly the node 1, is labeled as number 4, etc.

I was wondering what would be the simplest way of finding the correspondence between these 2 indices? I looked at the dendrogram function but didn't see any simple way of doing that (particularly if we truncate the dendrogram to some level (with e.g. truncate_mode='level', p=2)...

Note: I'm actually using a linkage matrix given by sklearn.cluster.AgglomerativeClustering but that doesn't really matter for this question (as illustrated in this github issue).

Note2: alternatively if there is a way to compute the list of leaves for every dendrogram node that would also solve my problem.

What I can deduce from the posted material are things I expect you've already noticed:

  1. The linkage matrix nodes are numbered in order of the clustering. The original nodes are 0-14. [0 3] becomes node 15, [1 8] is node 16, [6 [1 8]] is node 17, etc.
  2. The dendogram nodes are numbered from the root, depth-first, upper (right) branch before left (lower), with labels N-1 down to 0.
  3. Right and left are determined by the columns in the linkage matrix.

The linkage matrix has 2N+1 rows: N+1 original nodes and N clusterings. To reconstruct the dendogram labels, start at the last row of the matrix, [26 27]. This gets label N-1, or 13. Recur on the right node (column 1), then the left (column 0). Decrement the label value each time you assign it.

Is that clear enough to follow?


Recursion depth is a problem for you? The tree depth shouldn't be much more than log2(N) for N nodes, or about 22-25 for a million nodes. Your run-time stack can't handle that? Mine defaults to one thousand.

In any case, it's not that hard to convert to iteration. Create a stack of unresolved nodes, starting with the root node.

stack = [root]
while stack is not empty:
    node = stack.pop()
    stack.push (node.right)
    stack.push (node.left)
    process node  # assign node number, etc.

This makes it even easier to maintain the node counter, as it's now a local variable. Also note that this is a basic graph-search algorithm: "While we have nodes, grab the next one on the list, push its neighbors onto the list (but check to see whether each one is already handled), and process the current node."

For depth-first, use a stack; for breadth-first, use a queue.