
且构网 - 分享程序员编程开发的那些事


更新时间:2023-11-29 18:33:10



You must use a HashSet (or Dictionary) for speed:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));


That's a lambda expression version. It uses Dictionary which provides adaptability for varying the value if needed. Literal for-loops could be used and perhaps some more incremental performance improvement gained, but relative to having two-nested-loops, this is already an amazing improvement.


Learning a few things while looking at other answers, here is a faster implementation:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();


Importantly, this approach (while a tiny bit slower than @backs answer) is still versatile enough to use for database entities, AND other types can easily be used on the duplicate check field.


Here's an example how the code is easily adjusted for use with a Person kind of database entity list.

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)