且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从至少出现在N个不同集合中的集合列表中查找所有子集

更新时间:2023-02-05 19:22:02

这是一个 itertools 称为迭代1 的第一部分的方法。如果完全可以运行它,那么您可以循环运行它,并在每个阶段删除找到的项目。正如我在评论中指出的那样,仅对小 n 可行:

Here is an itertools approach for the first part of what you call Iteration 1. If it is feasible to run it at all, then you can run it repeatedly in a loop, removing the found items at each stage. As I indicated in the comments, it is only feasible for small n:

import itertools

def intersect(sets):
    #forms the intersection of all s in sets
    #assumes that there is at least one

    I = sets[0].copy()
    for s in sets[1:]:
        I = I & s
    return I

def largestIntersection(sets,n):
    maxSize = 0
    maxSets = [set()]
    for groupCombo in itertools.combinations(sets,n):
        I = intersect(groupCombo)
        if len(I) > maxSize:
            maxSize = len(I)
            maxSets = [I]
        elif len(I) == maxSize:
            maxSets.append(I)
    return maxSets

例如:

>>> groups = [
    [1],
    [1, 2 ,3],
    [1, 2, 3, 4, 5],
    [1, 2, 3 ,4],
    [2, 3],
    [4, 5, 6],
    [4, 5, 7]
]
>>> groups = [set(group) for group in groups]
>>> largestIntersection(groups,3)
[{1, 2, 3}]

编辑时:以下修改可能有帮助-取决于组中数字的分布和组的大小:

On The following modification might help -- depending on distribution of numbers in the groups and the size of the groups:

def guessSize(sets,n,trials):
    return max(len(intersect(random.sample(sets,n))) for trial in range(trials))

def maxIntersection(sets,n,trials = 100000):
    k = guessSize(sets,n,trials)
    filteredSets = [s for s in sets if len(s) >= k]
    return largestIntersection(filteredSets,n)

在尝试迭代相交之前,请减少组的数量。例如:

The idea is to first reduce the number of groups before you try to iterate over the intersections. For example:

#stress test:

import random
nums = list(range(1,101))
testGroups = [set(random.sample(nums,random.randint(1,100))) for n in range(1000)]
foundGroups = maxIntersection(testGroups,3)

计算 foundGroups 只需要几秒钟如果直接使用 largestIntersection(testGroups),则需要几分钟。另一方面,通过选择不同的随机参数,节省的时间可以忽略不计。

it takes only a few seconds to compute foundGroups as opposed to several minutes if I had directly used largestIntersection(testGroups). On the other hand, with different choices of the random parameters the time saving becomes negligible.

在进一步编辑中:也许您可以做得更多进取的过滤:

On Further Perhaps you can be even more aggressive with the filtering:

import collections
def maxIntersection(sets,n,trials = 100000):
    k = guessSize(sets,n,trials)
    filteredSets = [s for s in sets if len(s) >= k]
    counts = collections.Counter()
    for s in filteredSets:
        for x in s:
            counts[x] +=1
    t = {x for x,c in counts.items() if c >= k}
    filteredSets = [s & t for s in filteredSets if len(s & t) >= k]
    return largestIntersection(filteredSets,n)