且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

两个其他字符串之间的grep字符串作为分隔符

更新时间:2023-02-22 12:12:28

什么 Ansgar Wiechers的回答表示很好的建议。不要串搜索html文件。我没有问题,但值得注意的是,并非所有的html文件都是相同的,并且正则表达式搜索可能会产生有缺陷的结果。如果存在知道文件内容结构的工具,则应使用它们。



我想采取一种简单的方法,报告所有文件中包含足够的文本 list-unstyled 在给定目录中的所有html文件中。你期望有2?所以如果超过这个数字就足够了。我会做一个更复杂的正则表达式的解决方案,但既然你想要的行号,以及我想出了这种妥协。

  $ pattern =list-unstyled
Get-ChildItem C:\temp -Recurse -Filter *。 html |
Select-String $ pattern |
Group-Object Path |
Where-Object {$ _。Count -gt 2} |
ForEach-Object {
$ props = @ {
File = $ _。Group | Select-Object -First 1 -ExpandProperty Path
PatternFound =($ _。Group | Select-Object -ExpandProperty LineNumber)-join;
}

New-Object -TypeName PSCustomObject -Property $ props
}

Select-String 是一个 grep 类似工具,可以搜索文件的字符串。它会在文件中报告找到的行号,我为什么在这里使用它。



您应该在PowerShell控制台上看到如下所示的输出。

  File PatternFound 
---- ------------
C:\temp\content.html 4; 11; 54

其中4,11,54是找到文本的行。代码筛选出行数小于3的结果。因此,如果您希望在页眉和页脚中预留一次,则应排除这些结果。

I have to do a report on how many times a certain CSS class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page.

So, how do I grep for content?

EDIT: I am looking for if a page has list-unstyled between <main> and </main>

So do I use a regular expression for that grep? or do I need to use PowerShell to have more functionality?

I have grep at my disposal and PowerShell, but I could use a portable software if that is my only option.

Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.

EDIT: Progress

I now have this in PowerShell

$files = get-childitem -recurse -path w:\test\york\ -Filter *.html 
foreach ($file in $files)
{
$htmlfile=[System.IO.File]::ReadAllText($file.fullName)
$regex="(?m)<main([\w\W]*)</main>"
if ($htmlfile -match $regex) { 
    $middle=$matches[1] 
    [regex]::Matches($middle,"list-unstyled")
    Write-Host $file.fullName has matches in the middle:
}
}

Which I run with this command .\FindStr.ps1 | Export-csv C:\Tools\text.csv

it outputs the filename and path with string in the console, put does not add anything to the CSV. How can I get that added in?

What Ansgar Wiechers' answer says is good advice. Don't string search html files. I don't have a problem with it but it is worth noting that not all html files are the same and regex searches can produce flawed results. If tools exists that are aware of the file content structure you should use them.

I would like to take a simple approach that reports all files that have enough occurrences of the text list-unstyled in all html files in a given directory. You expect there to be 2? So if more than that show up then there is enough. I would have done a more complicated regex solution but since you want the line number as well I came up with this compromise.

$pattern = "list-unstyled"
Get-ChildItem C:\temp -Recurse -Filter *.html | 
    Select-String $pattern | 
    Group-Object Path | 
    Where-Object{$_.Count -gt 2} | 
    ForEach-Object{
        $props = @{
            File = $_.Group | Select-Object -First 1 -ExpandProperty Path
            PatternFound = ($_.Group | Select-Object -ExpandProperty LineNumber) -join ";"
        }

        New-Object -TypeName PSCustomObject -Property $props
    }

Select-String is a grep like tool that can search files for string. It reports the located line number in the file which I why we are using it here.

You should get output that looks like this on your PowerShell console.

File                                                                           PatternFound                                                                  
----                                                                           ------------                                                                  
C:\temp\content.html                                                           4;11;54

Where 4,11,54 is the lines where the text was found. The code filters out results where the count of lines is less than 3. So if you expect it once in the header and footer those results should be excluded.