如何从RSS feed描述标签中删除图像和文本?

更新时间：2023-11-27 20:29:16

代码:(演示)>

$xml = '<![CDATA[
    <p>Við vorum að fá inn til okkar forfallaholl í Laugardalsá á best tíma. Annarsvegar er um að ræða hollið 18-21. júlí og síðan hollið 24-27. júlí. Bæði eru hollin á frábærum tíma í ánn. Þó svo um 3ja daga holl sé að ræða, er að hægt að skoða staka daga eða 1 1/2 eða 2
    </p>
    <p>The post <a rel="nofollow" href="https://a.com/post-title/">Laugardalsá &#8211; forfallaholl á besta tíma</a> appeared first on <a rel="nofollow" href="https://a.com">a.com</a>.</p>
]]>';

$finds = [
    '~<p>The post <a rel="nofollow" href="https?://[a-z]+\.com[^"]*">.*?</a> appeared first on <a rel="nofollow" href="https?://[a-z]+\.com[^"]*">.*?</a>\.</p>~iu',
    '~^<!\[CDATA\[~',
    '~\]\]>$~'
];

var_export(trim(strip_tags(preg_replace($finds, '', $xml))));

输出:

'Við vorum að fá inn til okkar forfallaholl í Laugardalsá á best tíma. Annarsvegar er um að ræða hollið 18-21. júlí og síðan hollið 24-27. júlí. Bæði eru hollin á frábærum tíma í ánn. Þó svo um 3ja daga holl sé að ræða, er að hægt að skoða staka daga eða 1 1/2 eða 2'

我希望这会在很大程度上按照您所需的方式处理您的数据.第一个正则表达式模式肯定是最毛的(请参阅链接以获取模式说明).您将需要调整[abc]\.com以适合您的需求-可能会执行类似(?:test\.com|example\.net|sample\.co\.uk)的操作.直到获得正确"的结果，然后将一些输入数据输入regex101并不断调整模式，直到它起作用为止.

I expect this should largely handle your data in the way that you require. The first regex pattern is certainly the hairiest one (see the link for pattern explanation). You will need to adjust the [abc]\.com to suit your needs -- potentially doing something like (?:test\.com|example\.net|sample\.co\.uk). Until you get it "just right" just feed some input data into regex101 and keep tweaking your pattern until it works.

第二和第三模式只是清除文本包装程序.虽然第二个并不是真正必需的，因为strip_tags()可以清除该子字符串，但是第三个至关重要，因为strip_tags()会留下一个悬空的]]>.

The 2nd and 3rd patterns are just to clear away the text wrappers. While the 2nd one is not truly necessary because strip_tags() will clean that substring away, the 3rd is critical because strip_tags() will leave a dangling ]]>.

第一个模式不区分大小写(i)和Unicode容忍(u)，以获得***效果.

The first pattern is case-insensitive (i) and unicode-tolerant (u) for best results.

^和$是字符串定界符的开头和结尾.如果它们不适合您的实际数据，则可以将其删除.这些步骤仅是尝试删除"任何不需要的残留子字符串.我肯定会包括trim()调用，以便存储的数据尽可能干净.

^ and $ are beginning and end of string delimiters. If they are not suitable for your actual data, they can be removed. These steps are just attempts to "mop up" any unwanted residual substrings. The trim() call is certainly something that I would include so that the stored data is as clean as it can be.

如果要删除的特定<p>标记子字符串嵌套在两个要保留的子字符串之间，则您可能希望添加另一个模式以将多个\s{2,}压缩为单个空格，或者可以在\s*处写上我的第一个模式的结尾是捕获尾随空白.只有你会知道这一点.

If the specific <p> tagged substring to be removed is nested between two substrings to be kept, you may like to add another pattern to condense multiple \s{2,} to be a single space OR you might write \s* at the end of my first pattern to capture trailing whitespaces. Only you will know this.

上一篇 : ：解析 Android 的 RSS 提要下一篇 : RSS Feed和图像提取深入

如何从RSS feed描述标签中删除图像和文本?

相关阅读

推荐文章