更新时间:2023-01-25 19:15:27
纯正则表达式,它应该像< [>] +>
一样简单:
$ string -replace'< [>] +>',''
请注意, HTML注释或< pre>
标签的内容。
相反,您可以使用 HTML Agility Pack ,它是专为在.Net代码中使用而设计的,我之前在PowerShell中成功使用过它:
Add-Type -Path'C:\packages\HtmlAgilityPack.1.4.6\lib\\ \\ Net40-client\HtmlAgilityPack.dll'
$ doc = New-Object HtmlAgilityPack.HtmlDocument
$ doc.LoadHtml($ string)
$ doc.DocumentNode.InnerText
HTML Agility Pack适用于非完美的HTML。
I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.
Example String:
<p>This is an example</br>of various <span style="color: #445444">html content</span>
I have tried using:
$string -replace '\<([^\)]+)\>',''
It works with simple examples but ones such as above it captures the whole string.
Any suggestions on whats the best way to achieve this?
Thanks in advance
For a pure regex, it should be as easy as <[^>]+>
:
$string -replace '<[^>]+>',''
Note that this could fail with certain HTML comments or the contents of <pre>
tags.
Instead, you could use the HTML Agility Pack, which is designed for use in .Net code, and I've used it successfully in PowerShell before:
Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'
$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText
HTML Agility Pack works well with non-perfect HTML.