有两个非常大的列表/集合-如何有效地检测和/或删除重复项

更新时间：2023-11-29 18:33:10

哈希集和位置

您必须使用HashSet(或Dictionary)来提高速度:

You must use a HashSet (or Dictionary) for speed:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}

那是lambda表达式版本.它使用Dictionary(字典)，该字典可根据需要提供用于更改值的适应性.可以使用文字for循环，也许可以获得更多的增量性能改进，但是相对于具有两个嵌套的循环，这已经是一个了不起的改进.

That's a lambda expression version. It uses Dictionary which provides adaptability for varying the value if needed. Literal for-loops could be used and perhaps some more incremental performance improvement gained, but relative to having two-nested-loops, this is already an amazing improvement.

在学习其他答案的同时学习一些知识，这是一种更快的实现方式:

Learning a few things while looking at other answers, here is a faster implementation:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}

重要的是，这种方法(虽然比@backs回答慢一点点)仍然足够通用，可以用于数据库实体，并且其他类型也可以轻松地用于重复检查字段.

Importantly, this approach (while a tiny bit slower than @backs answer) is still versatile enough to use for database entities, AND other types can easily be used on the duplicate check field.

下面是一个示例，说明如何轻松调整代码以与Person类型的数据库实体列表一起使用.

Here's an example how the code is easily adjusted for use with a Person kind of database entity list.

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}

上一篇 : ：如何有效地省略连接两个大表下一篇 : 谷歌应用引擎 python:如何扩展 ndb 用户类

有两个非常大的列表/集合-如何有效地检测和/或删除重复项

相关阅读

推荐文章