且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

有没有更快的方法来截断 Unix 中的列

更新时间:2023-11-25 12:43:58

你的命令可以写得更好一点(假设你正在重建记录),这可能会提高一些性能:

Your command could be written a little more nicely (assuming you are re-building the record), which may give some performance increases:

awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,256) }' file > newFile

如果您可以访问多核机器(您可能会这样做),您可以使用 GNU平行.您可能想要改变您使用的内核数量(我在此处设置了 4 个)以及提供给 awk 的块大小(我已将其设置为 2 兆字节)...

If you have access to a multi-core machine (which you probably do), you can use GNU parallel. You may want to vary the number of cores you use (I've set 4 here) and the block size that's fed to awk (I've set this to two megabytes)...

< file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' > newFile

以下是我使用 2.7G 文件、1 亿行和 2M 块大小在我的系统上进行的一些测试:



Here's some testing I did on my system using a 2.7G file with 100 million lines and a block size of 2M:

time awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' file >/dev/null

结果:

real    1m59.313s
user    1m57.120s
sys     0m2.190s

单核:

time < file parallel -j 1 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null

结果:

real    2m28.270s
user    4m3.070s
sys     0m41.560s

四核:

time < file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null

结果:

real    0m54.329s
user    2m41.550s
sys     0m31.460s

十二核:

time < file parallel -j 12 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null

结果:

real    0m36.581s
user    2m24.370s
sys     0m32.230s