更新时间:2023-02-10 20:12:42
您建议的方法可能不是一个很好的方法。如果:
grep
PCRE
) div
只包装您的数据,而不包含其他元素 div
不包含其他 div
s 然后您可以使用:
(?s)< div style =float:left; padding-top:5px;>。*?< / div>
这个重要的部分是:
(?s)
,它激活 DOTALL
,这意味着。
会匹配换行符。*?
,它不情愿地匹配div的内容,它会停在第一个< / div>
它找到的位置。 要使用这个,你需要激活一些grep选项:
grep -Pzo $ PATTERN文件
对于这些:
-P
激活 PCRE
-z
替换 \\\
由 NUL
,所以grep会将整个文件视为一行 -o
仅打印匹配的部分 在此之后,您需要剥离div。 sed
是一个很好的工具。
sed's |< ; / \?div [^>] *> || g'
你可以在一个目录中将所有文件同时加入:
grep -Pzo $ PATTERN * .html | sed's |< / \\?div [^>]> || g'> out.html
How can I use regex to find everything except for data within div with a specific style? e.g.
<div style="float:left;padding-left:10px; padding-right:10px">
<img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
<img src="../Style/BreadCrumbs/Divider.png">
</div>
I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)
I'm using regex because I need to extract the data from several hundred files.
Your suggested method is probably not a good way to do this. If:
grep
PCRE
)div
only wraps your data, not other elementsdiv
does not contain other div
sThen you can use:
(?s)<div style="float:left; padding-top:5px;">.*?</div>
The important parts of this are:
(?s)
which activates DOTALL
, which means that .
will match newlines.*?
which matches the contents of the div reluctantly, which means it'll stop at the first </div>
it finds.To use this, you'll need to activate a few grep options:
grep -Pzo $PATTERN file
For these:
-P
activates the PCRE
-z
replaces \n
by NUL
so grep will treat the entire file as a single line-o
prints only the matching partsAfter this you'll need to strip off the divs. sed
is a good tool for this.
sed 's|</\?div[^>]*>||g'
If you put all of your files in one directory you can do the joining at the same time:
grep -Pzo $PATTERN *.html | sed 's|</\?div[^>]*>||g' > out.html