且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

PowerShell 多字符串替换效率

更新时间:2023-02-03 09:30:07

所以,你说的是,你想替换 150,000 行中每行的 600 个字符串中的任何一个,并且你想每行运行一个替换操作?

So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?

是的,有一种方法可以做到,但在 PowerShell 中没有,至少我想不出一个方法.它可以在 Perl 中完成.

Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.

方法:

  1. 构建一个哈希,其中键是东西,值是东西.
  2. | 符号连接哈希的键,并将其用作正则表达式中的匹配组.
  3. 在替换中,插入一个表达式,该表达式使用捕获组的匹配变量从哈希中检索值
  1. Construct a hash where the keys are the somethings and the values are the somethingelses.
  2. Join the keys of the hash with the | symbol, and use it as a match group in the regex.
  3. In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group

问题:

令人沮丧的是,PowerShell 没有在正则表达式替换调用之外公开匹配变量.它不适用于 -replace 运算符,也不适用于 [regex]::replace.

Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.

在 Perl 中,您可以这样做,例如:

In Perl, you can do this, for example:

$string =~ s/(1|2|3)/@{[$1 + 5]}/g;

这会将整个字符串的数字 1、2 和 3 加 5,所以如果字符串是1224526123 [2] [6]",它就会变成6774576678 [7] [6]".

This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".

但是,在 PowerShell 中,这两种方法都失败了:

However, in PowerShell, both of these fail:

$string -replace '(1|2|3)',"$($1 + 5)"

[regex]::replace($string,'(1|2|3)',"$($1 + 5)")

在这两种情况下,$1 的计算结果为 null,表达式计算结果为普通的 old 5.替换中的匹配变量仅在结果字符串中有意义,即单引号字符串或任何双引号字符串计算为.它们基本上只是看起来像匹配变量的反向引用.当然,您可以在双引号字符串中的数字前引用 $ ,因此它将评估为相应的匹配组,但这违背了目的 - 它不能参与表达式.

In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.

解决方案:

[此答案已根据原始答案进行了修改.它已被格式化以适合具有正则表达式元字符的匹配字符串.当然还有你的电视屏幕.]

如果您可以接受使用另一种语言,那么下面的 Perl 脚本非常有用:

If using another language is acceptable to you, the following Perl script works like a charm:

$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
  'something0' => 'somethingelse0',
  'something1' => 'somethingelse1',
  'something2' => 'somethingelse2',
  'something3' => 'somethingelse3',
  'something4' => 'somethingelse4',
  'something5' => 'somethingelse5',
  'X:\Group_14\DACU' => '\\DACU$',
  '.*[^xyz]' => 'oO{xyz}',
  'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
  push @strings, qr/\Q$_\E/;
  $replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', @strings;
while (<INPUT>) {
  s/($pattern)/$replacements{$1}/g;
  print OUTPUT;
}
close INPUT;
close OUTPUT;

它搜索散列的键(=> 的左侧),并用相应的值替换它们.这是发生的事情:

It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:

  • foreach 循环遍历散列的所有元素并创建一个名为 @strings 的数组,其中包含 %replacements 的键散列,元字符使用 \Q\E 引用,引用的结果用作正则表达式模式 (qr = quote regex).在同一遍中,它通过将替换字符串中的所有反斜杠加倍来转义它们.
  • 接下来,数组的元素与 | 连接以形成搜索模式.如果需要,您可以在 $pattern 中包含分组括号,但我认为这种方式可以更清楚地说明发生了什么.
  • while 循环从输入文件中读取每一行,用哈希中相应的替换字符串替换搜索模式中的任何字符串,并将该行写入输出文件.立>
  • The foreach loop goes through all the elements of the hash and create an array called @strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
  • Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
  • The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.

顺便说一句,您可能已经注意到原始脚本的其他一些修改.在我最近的 PowerShell 踢球过程中,我的 Perl 收集了一些灰尘,再次查看时我发现有几件事可以做得更好.

BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.

  • while () 一次读取文件一行.比将整个 150,000 行读入数组要明智得多,尤其是当您的目标是效率时.
  • 我将@{[$replacements{$1}]} 简化为$replacements{$1}.Perl 没有像 PowerShell 的 $() 这样的插入表达式的内置方法,因此 @{[ ]} 用作解决方法 - 它创建一个文字数组包含表达式的一个元素.但我意识到,如果表达式只是一个标量变量,则没有必要(我将它作为初始测试的保留,当时我将计算应用于 $1 匹配变量).
  • close 语句不是绝对必要的,但显式关闭文件句柄被认为是一种很好的做法.
  • 我将 for 的缩写更改为 foreach,以便 PowerShell 程序员更清楚、更熟悉.
  • while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
  • I simplified @{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so @{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
  • The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
  • I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.