使用正则表达式从HTML解析出内容？

更新时间：2023-02-10 20:12:42

您建议的方法可能不是一个很好的方法。如果：

您可以访问 grep

您的grep版本支持perl兼容正则表达式（ PCRE ）

div 只包装您的数据，而不包含其他元素
'data' div 不包含其他 div s

然后您可以使用：

 （？s）< div style =float：left; padding-top：5px;>。*？< / div>

这个重要的部分是：

（？s），它激活 DOTALL ，这意味着。会匹配换行符

。*？，它不情愿地匹配div的内容，它会停在第一个< / div> 它找到的位置。

要使用这个，你需要激活一些grep选项：

  grep -Pzo $ PATTERN文件

对于这些：

-P 激活 PCRE

-z 替换\\\由 NUL ，所以grep会将整个文件视为一行

-o 仅打印匹配的部分

在此之后，您需要剥离div。 sed 是一个很好的工具。

  sed's |&lt ; / \？div [^>] *> || g'

你可以在一个目录中将所有文件同时加入：

  grep -Pzo $ PATTERN * .html | sed's |< / \\？div [^>]> || g'> out.html

How can I use regex to find everything except for data within div with a specific style? e.g.

<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
    Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>

I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)

I'm using regex because I need to extract the data from several hundred files.

Your suggested method is probably not a good way to do this. If:

you have access to grep
your version of grep supports perl-compatible regex (PCRE)
this style of div only wraps your data, not other elements
the 'data' div does not contain other divs

Then you can use:

(?s)<div style="float:left; padding-top:5px;">.*?</div>

The important parts of this are:

(?s) which activates DOTALL, which means that . will match newlines
.*? which matches the contents of the div reluctantly, which means it'll stop at the first </div> it finds.

To use this, you'll need to activate a few grep options:

grep -Pzo $PATTERN file

For these:

-P activates the PCRE
-z replaces \n by NUL so grep will treat the entire file as a single line
-o prints only the matching parts

After this you'll need to strip off the divs. sed is a good tool for this.

sed 's|</\?div[^>]*>||g'

If you put all of your files in one directory you can do the joining at the same time:

grep -Pzo $PATTERN *.html | sed 's|</\?div[^>]*>||g' > out.html

上一篇 : ：如何使用 argparse 解析带有前导减号(负数)的位置参数下一篇 : 逻辑表达式解析器

使用正则表达式从HTML解析出内容？

相关阅读

技术问答最新文章