且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从python中的元组列表中删除重复的功能

更新时间:2023-12-04 21:37:04

而不是试图弄清我们的代码是什么,并修复它,让我们回到你的英文描述:

Rather than try to figure our what your code is trying to do and fix it, let's go back to your English description:


在英文中我正在尝试用dupCatch()从sqlPull()获取数据,初始化和空列表,并为变量数据中的所有元组表示if那个tupl e不在空列表中,将其添加到newData变量中,如果没有,则将lastPull设置为非唯一元组。

In english what I am attempting to do with dupCatch() is take the data from sqlPull(), initialize and empty list and say for all of the tuples in the variable data if that tuple is not in the empty list, add it to the newData variable, if not, set lastPull equal to the non-unique tuples.

所以

seen = set()
def dupCatch():
    data = sqlPull()
    new_data = []
    for (TimeStamp, MAC, RSSI) in data:
        if (TimeStamp, MAC, RSSI) not in seen:
            seen.add((TimeStamp, MAC, RSSI))
            new_data.append((TimeStamp, MAC, RSSI))
    print new_data

或者更简洁:

seen = set()
def dupCatch():
    data = sqlPull()
    newData = [row for row in data if row not in seen]
    seen.update(newData)
    print new_data

无论哪种方式,这里的诀窍是,我们有一套跟踪我们见过的每一行。所以,对于每一行,如果它在这个集合,我们已经看到它,可以忽略它;否则,我们不得不忽略它,并将其添加到该集合中。

Either way, the trick here is that we have a set which keeps track of every row we've ever seen. So, for each new row, if it's in that set, we've seen it and can ignore it; otherwise, we have to not ignore it, and add it to the set for later.

第二个版本只是通过一次过滤所有5行来简化事情,然后 update - 同时使用所有新的集合,而不是一行一行。

The second version just simplifies things by filtering all 5 rows at once, and then update-ing the set with all of the new ones at once, instead of doing it row by row.

看到必须是全球化的原因是,全球人生活永远在整个功能的运行中,所以我们可以使用它来跟踪我们见过的每一行;如果我们把它当作功能,那么每次都是新的,所以我们只会跟踪我们在当前批次中看到的行,这不是非常有用。

The reason that seen has to be global is that a global lives forever, across all runs of the function, so we can use it to keep track of every row we've ever seen; if we made it local to the function, it would be new each time, so we'd only be keeping track of rows we've seen in the current batch, which isn't very useful.

一般来说,全局变量是坏的。但是,像持久缓存这样的事情是一般规则的例外。他们的全部意见是他们不是本地。如果你有一个有意义的对象模型,看到会更好,因为任何对象的成员 dupCatch 是一种比全球化的方法。如果你有一个很好的理由将函数定义为另一个函数中的闭包,那么 $ c>将会更好地作为关闭的一部分。等等。但是否则,全球是***的选择。

In general, globals are bad. However, things like persistent caches are an exception to the "in general" rule. The whole point of them is that they're not local. If you had an object model in mind that made sense, seen would be much better as a member of whatever object dupCatch was a method on than as a global. If you had a good reason to define the function as a closure inside another function, seen would be better as part of that closure. And so on. But otherwise, a global is the best option.

如果你重组了你的代码,你可以使这更简单:

If you reorganized your code a bit, you could make this even simpler:

def pull():
    while True:
        for row in sqlPull():
            yield row
for row in unique_everseen(pull()):
    print row

...甚至:

for row in unique_everseen(chain.from_iterable(iter(sqlPull, None))):
    print row

请参阅迭代器和接下来的几个教程部分, itertools 文档, David M. Beazley的演讲了解这个最后一个版本。但对于新手来说,您可能希望坚持使用第二个版本。

See Iterators and the next few tutorial sections, the itertools documentation, and David M. Beazley's presentations to understand what this last version does. But for a novice, you might want to stick with the second version.