Powershell删除字符串内容中的HTML标签

更新时间：2023-01-25 19:15:27

纯正则表达式，它应该像< [>] +> 一样简单：

  $ string -replace'< [>] +>'，''

Debuggex Demo

请注意， HTML注释或< pre> 标签的内容。

相反，您可以使用 HTML Agility Pack ，它是专为在.Net代码中使用而设计的，我之前在PowerShell中成功使用过它：

 Add-Type -Path'C：\packages\HtmlAgilityPack.1.4.6\lib\\ \\ Net40-client\HtmlAgilityPack.dll'

 $ doc = New-Object HtmlAgilityPack.HtmlDocument 
 $ doc.LoadHtml（$ string）
 $ doc.DocumentNode.InnerText

HTML Agility Pack适用于非完美的HTML。

I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.

Example String:

<p>This is an example</br>of various <span style="color: #445444">html content</span>

I have tried using:

$string -replace '\<([^\)]+)\>',''

It works with simple examples but ones such as above it captures the whole string.

Any suggestions on whats the best way to achieve this?

Thanks in advance

For a pure regex, it should be as easy as <[^>]+>:

$string -replace '<[^>]+>',''

Debuggex Demo

Note that this could fail with certain HTML comments or the contents of <pre> tags.

Instead, you could use the HTML Agility Pack, which is designed for use in .Net code, and I've used it successfully in PowerShell before:

Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml($string)
$doc.DocumentNode.InnerText

HTML Agility Pack works well with non-perfect HTML.

上一篇 : ：在UITableViewCell中播放视频下一篇 : 在应用程序或全局主题目录中找不到主题“白色".

Powershell删除字符串内容中的HTML标签

相关阅读

技术问答最新文章