且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

与dict()相比,Python OrderDict溅射

更新时间:2023-02-27 14:18:52

正如您自己的测试所证明的,您的内存不足.即使在CPython 3.6(实际上已经订购了普通dict,尽管尚不能作为一种语言保证)上,与dict相比,OrderedDict仍具有显着的内存开销.它仍然可以通过边带链表实现,以保留顺序并支持简单的迭代,使用move_to_end重新排序等.您可以通过使用sys.getsizeof进行检查来判断(确切的结果因Python版本和构建位宽而异,32与64位):

>>> od = OrderedDict([("a", 1), ("b", 2), ("c", 3)])
>>> d = {**od}
>>> sys.getsizeof(od)
464   # On 3.5 x64 it's 512
>>> sys.getsizeof(d)
240   # On 3.5 x64 it's 288

忽略存储的数据,此处OrderedDict的开销几乎是普通dict的开销的两倍.如果您要制造400万个这样的项目,那么在我的机器上,这将增加850 MB以上的运行开销(在3.5和3.6上).

您系统上所有其他程序的组合,加上Python程序,可能已超过分配给计算机的RAM,并且您被卡在交换swap动中.特别是,每当asset_hist必须扩展以查找新条目时,很可能需要分页很大一部分(由于缺乏使用而被分页),并且每当循环垃圾收集运行触发时(每次发生一次完整的GC时,默认情况下有70,000个分配和释放),所有OrderedDict返回页面以检查是否仍在循环之外引用它们(您可以通过禁用循环GC 通过gc.disable() ).

鉴于您的特定用例,我强烈建议您同时避免同时使用dictOrderedDict.当您一次又一次地拥有三个完全相同的固定键时,即使dict的开销,甚至是Python 3.6上更便宜的形式,都将是一种极端的情况.取而代之的是使用collections.namedtuple (专为轻量级设计)可通过名称或索引引用的对象(它们的作用类似于常规的tuple,但也允许将每个值作为命名属性访问),这将大大降低程序的内存成本(即使在内存不足的情况下也可能加快程序的内存消耗)一个问题).

例如:

from collections import namedtuple

ComputerInfo = namedtuple('ComputerInfo', ['computer_name', 'id', 'hist_item'])

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        asset_hist.append(ComputerInfo(key_host, index, hist_item))

唯一的区别是将row['computer_name']替换为row.computer_name,或者如果需要所有值,则可以像常规tuple一样将其解压缩,例如comphost, idx, hist = row.如果您暂时需要一个真正的OrderedDict(不要为所有内容存储它们),则可以调用row._asdict()以获取具有与namedtuple相同映射关系的OrderedDict,但这通常不是必需的.节省内存是有意义的.在我的系统上,三个元素namedtuple将每个项目的开销减少到72个字节,不到3.6 dict的三分之一,也小于3.6 OrderedDict的六分之一(而三个元素namedtuple在3.5上仍保留72个字节,其中dict/OrderedDict在3.6之前的版本中较大).它可能节省的更多. tuple(和扩展名namedtuple)被分配为单个连续的C struct,而dict和company至少是两个分配(一个分配给对象结构,一个或多个分配给动态调整大小的部分).结构),每个都可能要支付分配器的开销和对齐成本.

无论哪种方式,对于您的四百万行方案,使用namedtuple意味着总共要支付275 MB的开销(超出值的成本),而OrderedDict的1770(3.6)-1950(3.5)MB.在谈论8 GB系统时,将开销减少1.5 GB是一项重大改进.

This one has me entirely baffled.

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        #row = collections.OrderedDict([("computer_name", key_host), ("id", index), ("hist_item", hist_item)])
        row = {"computer_name": key_host, "id": index, "hist_item": hist_item}
        asset_hist.append(row)

This code works perfectly with the collections line commented out. However, when I comment out the row = dict line and remove the comment from the collections line things get very strange. There are about 4 million of these rows being generated and appended to asset_hist.

So, when I use row=dict, the entire loop finishes in about 10 milliseconds, lightning fast. When I use the ordered dictionary, I've waited over 10 minutes and it still didn't finish. Now, I know OrderDict is supposed to be a little slower than a dict, but it's supposed to be about 10x slower at worst and by my math its actually about 100,000 times slower in this function.

I decided to print the index in the lowest loop to see what was happening. Interestingly enough, I noticed a sputtering in console output. The index would print very fast on the screen and then stop for about 3-5 seconds before continuing on.

am_output.asset_history is a dictionary which has one key, host, and every row is a list of strings. E.g.

am_output.asset_history = {"host1": ["string1", "string2", ...], "host2": ["string1", "string2", ...], ...}

EDIT: Sputter Analysis with OrderedDict

Total Memory on this VM Server: Only 8GB... need to get more provissioned.

LOOP NUM

184796 (~5 second wait, ~60% memory usage)

634481 (~5 second wait, ~65% memory usage)

1197564 (~5 second wait, ~70% memory usage)

1899247 (~5 second wait, ~75% memory usage)

2777296 (~5 second wait, ~80% memory usage)

3873730 (LONG WAIT... waited 20 minutes and gave up!, 88.3% memory usage, process is still running)

Where the wait happens changes with each run.

EDIT: Ran it again, this time it stop on 3873333, close to the spot it stopped before. It stopped after forming the row, while trying to append... I didn't notice this last attempt but it was there then too... the problem is with the append line, not the row line... I'm still baffled. Here's the row it produced right before the long stop (added the row to the print statement)... hostname changed to protect the innocent:

3873333: OrderedDict([('computer_name', 'bg-fd5612ea'), ('id', 1), ('hist_item', "sys1 Normalizer (sys1-4): Domain Name cannot be determined from sys1 Name 'bg-fd5612ea'.")])

As your own tests prove, you're running out of memory. Even on CPython 3.6 (where plain dict is actually ordered, though not as a language guarantee yet), OrderedDict has significant memory overhead compared to dict; it's still implemented with a side-band linked list to preserve the order and support easy iteration, reordering with move_to_end, etc. You can tell just by checking with sys.getsizeof (exact results will differ by Python version and build bitwidth, 32 vs. 64 bit):

>>> od = OrderedDict([("a", 1), ("b", 2), ("c", 3)])
>>> d = {**od}
>>> sys.getsizeof(od)
464   # On 3.5 x64 it's 512
>>> sys.getsizeof(d)
240   # On 3.5 x64 it's 288

Ignoring the data stored, the overhead for the OrderedDict here is nearly twice that of the plain dict. If you're making 4 million of these items, on my machine that would add overhead of a titch over 850 MB (on both 3.5 and 3.6).

It's likely the combination of all the other programs on your system, plus your Python program, is exceeding the RAM allocated to your machine, and you're stuck swap thrashing. In particular, whenever asset_hist has to expand for new entries, it's likely needing to page in large parts of it (that got paged out for lack of use), and whenever a cyclic garbage collection run triggers (a full GC occurs roughly every 70,000 allocations and deallocations by default), all the OrderedDicts get paged back in to check if they're still referenced outside of cycles (you could check if the GC runs are the main problem by disabling cyclic GC via gc.disable()).

Given your particular use case, I'd strongly recommend avoiding both dict and OrderedDict though. The overhead of even dict, even the cheaper form on Python 3.6, is kind of extreme when you have a set of exactly three fixed keys over and over. Instead, use collections.namedtuple, which is designed for lightweight objects referenceable by either name or index (they act like regular tuples, but also allow accessing each value as a named attribute), which would dramatically reduce the memory cost of your program (and likely speed it up even when memory is not an issue).

For example:

from collections import namedtuple

ComputerInfo = namedtuple('ComputerInfo', ['computer_name', 'id', 'hist_item'])

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        asset_hist.append(ComputerInfo(key_host, index, hist_item))

Only difference in use is that you replace row['computer_name'] with row.computer_name, or if you need all the values, you can unpack it like a regular tuple, e.g. comphost, idx, hist = row. If you need a true OrderedDict temporarily (don't store them for everything), you can call row._asdict() to get an OrderedDict with the same mapping as the namedtuple, but that's not normally needed. The memory savings are meaningful; on my system, the three element namedtuple drops the per-item overhead to 72 bytes, less than a third that of even the 3.6 dict and less than a sixth of a 3.6 OrderedDict (and three element namedtuple remains 72 bytes on 3.5, where dict/OrderedDict are larger pre-3.6). It may save even more than that; tuples (and namedtuple by extension) are allocated as a single contiguous C struct, while dict and company are at least two allocations (one for the object structure, one or more for the dynamically resizable parts of the structure), each of which may pay allocator overhead and alignment costs.

Either way, for your four million row scenario, using namedtuple would mean paying (beyond the cost of the values) overhead of around 275 MB total, vs. 915 (3.6) - 1100 (3.5) MB for dict and 1770 (3.6) - 1950 (3.5) MB for OrderedDict. When you're talking about an 8 GB system, shaving 1.5 GB off your overhead is a major improvement.