且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在递归CTE中检测重复项

更新时间:2022-10-19 12:54:56

您可以使用tablefunc模块中存在的connectby函数。 / p>

首先,您需要启用模块

 创建扩展tablefunc; 

然后,您可以使用connectby函数(根据您在问题中提供的示例表,

 从connectby('objectdependencies','id','dependson',' 4',0)
AS t(id int,dependon int,level int)
其中id!= 4;

这将返回:
1
2
3



以下是文档中参数的解释:

  connectby(text relname,文本keyid_fld,文本parent_keyid_fld 
[,文本orderby_fld],文本start_with,int max_depth
[,文本branch_delim])




  • relname源关系的名称

  • keyid_fld关键字字段的名称

  • parent_keyid_fld父键字段的名称

  • orderby_fld用于排序同级的字段名称(可选)

  • start_with行的键值起始于

  • max_depth下降到的最大深度,或者为无限深度为零

  • branch_delim字符串,用于在分支输出中分隔键(可选)



请查阅文档以获取更多信息。
https://www.postgresql.org/docs/9.5/ static / tablefunc.html


I have a set of dependencies stored in my database. I'm looking to find all the objects that depend on the current one, whether directly or indirectly. Since objects can depend zero or more other objects, it's perfectly reasonable that object 1 is depended on by object 9 twice (9 depends on 4 and 5, both of which depend on 1). I'd like to get the list of all the objects that depend on the current object without duplication.

This gets more complex if there are loops. Without loops, one could use DISTINCT, though going through long chains more than once only to cull them at the end is still a problem. With loops, however, it becomes important that the RECURSIVE CTE doesn't union with something it has already seen.

So what I have so far looks like this:

WITH RECURSIVE __dependents AS (
  SELECT object, array[object.id] AS seen_objects
  FROM immediate_object_dependents(_objectid) object
  UNION ALL
  SELECT object, d.seen_objects || object.id
  FROM __dependents d
  JOIN immediate_object_dependents((d.object).id) object
    ON object.id <> ALL (d.seen_objects)
) SELECT (object).* FROM __dependents;

(It's in a stored procedure, so I can pass in _objectid)

Unfortunately, this just omits a given object when I've seen it before in the current chain, which would be fine if a recursive CTE was being done depth-first, but when it's breadth-first, it becomes problematic.

Ideally, the solution would be in SQL rather than PLPGSQL, but either one works.

As an example, I set this up in postgres:

create table objectdependencies (
  id int,
  dependson int
);

create index on objectdependencies (dependson);

insert into objectdependencies values (1, 2), (1, 4), (2, 3), (2, 4), (3, 4);

And then I tried running this:

with recursive rdeps as (
  select dep
  from objectdependencies dep
  where dep.dependson = 4 -- starting point
  union all
  select dep
  from objectdependencies dep
  join rdeps r
    on (r.dep).id = dep.dependson
) select (dep).id from rdeps;

I'm expecting "1, 2, 3" as output.

However, this somehow goes on forever (which I also don't understand). If I add in a level check (select dep, 0 as level, ... select dep, level + 1, on ... and level < 3), I see that 2 and 3 are repeating. Conversely, if I add a seen check:

with recursive rdeps as (
  select dep, array[id] as seen
  from objectdependencies dep
  where dep.dependson = 4 -- starting point
  union all
  select dep, r.seen || dep.id
  from objectdependencies dep
  join rdeps r
    on (r.dep).id = dep.dependson and dep.id <> ALL (r.seen)
) select (dep).id from rdeps;

then I get 1, 2, 3, 2, 3, and it stops. I could use DISTINCT in the outer select, but that only reasonably works on this data because there is no loop. With a larger dataset and more loops, we will continue to grow the CTE's output only to have the DISTINCT pare it back down. I would like the CTE to simply stop that branch when it's already seen that particular value somewhere else.

Edit: this is not simply about cycle detection (though there can be cycles). It's about uncovering everything referenced by this object, directly and indirectly. So if we have 1->2->3->5->6->7 and 2->4->5, we can start at 1, go to 2, from there we can go to 3 and 4, both of those branches will go to 5, but I don't need both branches to do so - the first one can go to 5, and the other can simply stop there. Then we go on to 6 and 7. Most cycle detection will find no cycles and return 5, 6, 7 all twice. Given that I expect most of my production data to have 0-3 immediate references, and most of those to be likewise, it will be very common for there to be multiple branches from one object to another, and going down those branches will be not only redundant but a huge waste of time and resource.

You can use the connectby function which exists in the tablefunc module.

First you need to enable the module

CREATE EXTENSION tablefunc;

Then you can use the connectby function (based on the sample table you provided in the question it will as follows):

SELECT distinct id
FROM connectby('objectdependencies', 'id', 'dependson', '4', 0)
AS t(id int, dependson int, level int)
where id != 4;

This will return: 1 2 3

Here is an explanation of the parameters from documentation:

connectby(text relname, text keyid_fld, text parent_keyid_fld
          [, text orderby_fld ], text start_with, int max_depth
          [, text branch_delim ])

  • relname Name of the source relation
  • keyid_fld Name of the key field
  • parent_keyid_fld Name of the parent-key field
  • orderby_fld Name of the field to order siblings by (optional)
  • start_with Key value of the row to start at
  • max_depth Maximum depth to descend to, or zero for unlimited depth
  • branch_delim String to separate keys with in branch output (optional)

please consult the documentation for more information. https://www.postgresql.org/docs/9.5/static/tablefunc.html