且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

PHP:忽略html标记之间的内容时进行正则表达式替换

更新时间:2023-02-17 23:22:49

我没有测试此页面中使用的逻辑-

I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.

HTML不是统一的,如果在任何现实情况下使用正则表达式来处理这些标记的内容,边缘情况总是会在后面咬你.因此,除非您的标记极其简单,统一,100%准确,仅包含html(不包括CSS,javascript或垃圾),否则***的选择是dom解析器库.

Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.

确实很多dom解析器库也有问题,但是您将比regex同行领先.获取标签文本竞争的***方法是在浏览器中呈现html并访问给定dom节点的innerText属性(或进行人工复制并手动粘贴内容)-但这并不总是一种选择:D

And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D