更新时间:2023-11-25 12:43:58
你的命令可以写得更好一点(假设你正在重建记录),这可能会提高一些性能:
Your command could be written a little more nicely (assuming you are re-building the record), which may give some performance increases:
awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,256) }' file > newFile
如果您可以访问多核机器(您可能会这样做),您可以使用 GNU平行.您可能想要改变您使用的内核数量(我在此处设置了 4 个)以及提供给 awk
的块大小(我已将其设置为 2 兆字节)...
If you have access to a multi-core machine (which you probably do), you can use GNU parallel. You may want to vary the number of cores you use (I've set 4 here) and the block size that's fed to awk
(I've set this to two megabytes)...
< file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' > newFile
以下是我使用 2.7G 文件、1 亿行和 2M 块大小在我的系统上进行的一些测试:
Here's some testing I did on my system using a 2.7G file with 100 million lines and a block size of 2M:
time awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' file >/dev/null
结果:
real 1m59.313s
user 1m57.120s
sys 0m2.190s
单核:
time < file parallel -j 1 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
结果:
real 2m28.270s
user 4m3.070s
sys 0m41.560s
四核:
time < file parallel -j 4 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
结果:
real 0m54.329s
user 2m41.550s
sys 0m31.460s
十二核:
time < file parallel -j 12 --pipe --block 2M -q awk 'BEGIN { FS=OFS="\t" } { $4 = substr($4,0,2) }' >/dev/null
结果:
real 0m36.581s
user 2m24.370s
sys 0m32.230s