且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用正则表达式从HTML解析出内容?

更新时间:2023-02-10 20:12:42

您建议的方法可能不是一个很好的方法。如果:


  • 您可以访问 grep

  • 您的grep版本支持perl兼容正则表达式( PCRE

  • div 只包装您的数据,而不包含其他元素
  • 'data' div 不包含其他 div s



然后您可以使用:

 (?s)< div style =float:left; padding-top:5px;>。*?< / div> 

这个重要的部分是:


  • (?s),它激活 DOTALL ,这意味着会匹配换行符

  • 。*?,它不情愿地匹配div的内容,它会停在第一个< / div> 它找到的位置。


要使用这个,你需要激活一些grep选项:

  grep -Pzo $ PATTERN文件

对于这些:


  • -P 激活 PCRE

  • -z 替换 \\\
    NUL ,所以grep会将整个文件视为一行

  • -o 仅打印匹配的部分


在此之后,您需要剥离div。 sed 是一个很好的工具。

  sed's |&lt ; / \?div [^>] *> || g'

你可以在一个目录中将所有文件同时加入:

  grep -Pzo $ PATTERN * .html | sed's |< / \\?div [^>]> || g'> out.html 


How can I use regex to find everything except for data within div with a specific style? e.g.

<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
    Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>

I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)

I'm using regex because I need to extract the data from several hundred files.

Your suggested method is probably not a good way to do this. If:

  • you have access to grep
  • your version of grep supports perl-compatible regex (PCRE)
  • this style of div only wraps your data, not other elements
  • the 'data' div does not contain other divs

Then you can use:

(?s)<div style="float:left; padding-top:5px;">.*?</div>

The important parts of this are:

  • (?s) which activates DOTALL, which means that . will match newlines
  • .*? which matches the contents of the div reluctantly, which means it'll stop at the first </div> it finds.

To use this, you'll need to activate a few grep options:

grep -Pzo $PATTERN file

For these:

  • -P activates the PCRE
  • -z replaces \n by NUL so grep will treat the entire file as a single line
  • -o prints only the matching parts

After this you'll need to strip off the divs. sed is a good tool for this.

sed 's|</\?div[^>]*>||g'

If you put all of your files in one directory you can do the joining at the same time:

grep -Pzo $PATTERN *.html | sed 's|</\?div[^>]*>||g' > out.html