且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

标记重复记录的 T-SQL 查询

更新时间:2023-01-29 11:32:43

忽略 LineNumber 为空.在这种情况下应该如何处理 IsRepeat?

Ignores LineNumber is null. How should IsRepeat be handled in that case?

它适用于测试数据.它的效率是否足以满足生产量的需求?

It works for test data. Whether it will be efficient enough for production volumes?

在pair上有重复(LineNumber, CreatedOn)的情况下,任意选择一个.(JobId 最小的那个)

In the case of duplicate (LineNumber, CreatedOn) on pairs, arbitrarily choose one. (The one with minimum JobId)

基本思路:

  1. 获取所有的 JobId 对至少相隔 7 天,由行号.
  2. 数一数超过 7 天的行从左侧,向上和包括右侧.(CNT)
  3. 然后我们知道如果 JobId x 不是重复的,那么下一个不是重复的是带有 X 的对左侧,CNT = 1
  4. 使用递归 CTE 从每个 LineNumber 的第一行开始
  5. 递归元素使用带有计数的对来获取下一行.
  6. 最后更新,将所有 IsRepeat 设置为 0(非重复)和 1(其他).

; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
            , LeftJobId
            , RightJobId
            , BeginCreatedOn
            , EndCreatedOn) as
        (select l.LineNumber
            , l.JobId
            , r.JobId
            , dateadd(day, 7, l.CreatedOn)
            , r.CreatedOn
        from Job l
        inner join Job r
            on l.LineNumber = r.LineNumber
            and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
            and l.JobId <> r.JobId)
    -- Count the number of rows within from BeginCreatedOn 
    -- up to and including EndCreatedOn
    -- In the case of CreatedOn = EndCreatedOn, 
    -- include only jobId <= jobid, to handle ties in CreatedOn        
    , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
        (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
        from AllPairsByLineNumberAtLeast7DaysApart ap
        inner join Job j
            on j.LineNumber = ap.LineNumber
            and ap.BeginCreatedOn <= j.createdOn
            and (j.CreatedOn < ap.EndCreatedOn
                or (j.CreatedOn = ap.EndCreatedOn 
                    and j.JobId <= ap.RightJobId))
         group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
    , Step1 (LineNumber, JobId, CreatedOn, RN) as
        (select LineNumber, JobId, CreatedOn
            , row_number() over 
                (partition by LineNumber order by CreatedOn, JobId)
        from Job)
    , Results (JobId, LineNumber, CreatedOn) as    
        -- Start with the first rows.
        (select JobId, LineNumber, CreatedOn
        from Step1
        where RN = 1
        and LineNumber is not null
        -- get the next row
        union all
        select j.JobId, j.LineNumber, j.CreatedOn
        from Results r
        inner join AllPairsCount apc on apc.LeftJobId = r.JobId
        inner join Job j
            on j.JobId = apc.RightJobId
            and apc.CNT = 1)
    update j
    set IsRepeat = case when R.JobId is not null then 0 else 1 end
    from Job j
    left outer join Results r
        on j.JobId = R.JobId
    where j.LineNumber is not null

昨晚我关掉电脑后,我意识到我让事情变得比他们需要的更复杂.一个更直接(在测试数据上,稍微更有效)的查询:

After I turned off the computer last night I realized I had made things more complicated than they needed to be. A more straightforward (and on the test data, slightly more effecient) query:

基本思路:

  1. Generated PotentialStep (FromJobId, ToJobId) 这些是如果 FromJobId不是重复,比 ToJobId 也不是重复.(第一行按 LineNumber 更多距 FromJobId 超过 7 天)
  2. 使用递归 CTE 从每个 LineNumber 的第一个 JobId 开始,然后步进,使用 PontentialSteps,到每个非重复 JobId

; with PotentialSteps (FromJobId, ToJobId) as
    (select FromJobId, ToJobId
    from (select f.JobId as FromJobId
            , t.JobId as ToJobId
            , row_number() over
                 (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
        from Job f
        inner join Job t
            on f.LineNumber = t.LineNumber
            and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
        where RN = 1)
, NonRepeats (JobId) as
    (select JobId
    from (select JobId
            , row_number() over
                (partition by LineNumber order by CreatedOn, JobId) as RN
        from Job) Start
    where RN = 1
    union all
    select J.JobId
    from NonRepeats NR
    inner join PotentialSteps PS
        on NR.JobId = PS.FromJobId
    inner join Job J
        on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null