且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

根据每行列中的最大值过滤文件

更新时间:2023-02-08 20:32:46

这似乎是正确的,假设数据是有序的,那么具有相同的前两个名称组成部分的所有行都将在数据文件中分组在一起.这些行在组中的顺序无关紧要.

This seems to work correctly, assuming that the data is ordered so that all the lines with the same first two name components are grouped together in the data file. The order of those lines within the group doesn't matter.

#!/bin/sh

awk '
    function dump_memo()
    {
        if (memo_num > 0)
        {
            for (i = 0; i < memo_num; i++)
                print memo_line[i]
        }
    }
    {
        split($1, a, ".")
        key = a[1] "." a[2]
        val = $NF
        # print "# " key " = " val " (memo_key = " memo_key ", memo_val = " memo_val ")"
        if (memo_key == key)
        {
            if (memo_val == val)
            {
                memo_line[memo_num++] = $0
            }
            else if (memo_val < val)
            {
                memo_val = val
                memo_num = 0
                memo_line[memo_num++] = $0
            }
        }
        else
        {
            dump_memo()
            memo_num = 0
            memo_line[memo_num++] = $0
            memo_key = key
            memo_val = val
        }
    }
    END { dump_memo() }' "$@"

在问题中显示的数据文件上运行时,输出为:

When run on the data file shown in the question, the output is:

gene.100079.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   86.7
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   99.9
chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30  0   1   169 4   172 1.3e-75 283.1   76.7

此内容与您要求的内容之间的主要区别是排序顺序.如果需要按排序的数据,请通过sort用管道传输脚本的输出.

The main difference between this and what you request is the sort order. If you need the data in sorted order, pipe the output of the script through sort.