且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

正则表达式匹配段落中的单词但排除内部HTML标记

更新时间:2023-02-17 22:31:30

您可以使用负向预测来确保在没有开放角度支架的情况下没有关闭角度支架:

  \ b(\ w +(?![^<>]>))\ b 
^^^^^^^^^^ ^

我想你可以安全地移除 \w +?中,因为你匹配整个单词。


I'm using a regex I found on *** to surround word instances in a paragraph with a span tag such that a user can click on each inner word to see that word's definition. This works perfectly, however, the issue I've run into is that sometimes the paragraph has phrases of words which are wrapped in an inner or tag, e.g. a title.

Works:

<div id="passage"> 
<p>
    Hello, my name is SirTophamHatt.
</p>
...
</div>

$('#passage').find('p').each(function() {
    $(this).html(function (index, oldHtml) {
        return oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>');
    });
});

<div id="passage">
<p>
    <span class="word">Hello</span>, <span class="word">my</span> <span class="word">name</span> <span class="word">is</span> <span class="word">SirTophamHatt</span>.
</p>
...
</div>

Does not work:

<div id="passage"> 
<p>
    <em>Hello, my name is SirTophamHatt.</em>
</p>
...
</div>

$('#passage').find('p').each(function() {
    $(this).html(function (index, oldHtml) {
        return oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>');
    });
});

<div id="passage">
<p>
    <
    <span class="word">em</span>
    >
    <span class="word">Hello</span>, 
    <span class="word">my</span> 
    <span class="word">name</span> 
    <span class="word">is</span> 
    <span class="word">SirTophamHatt</span>
    <!--<span class="word">-->em>
</p>
...
</div>

I separated the last paragraph for clarity.

I'm not great with regex; how can I modify the pattern such that it will match all words which are not starting or closing HTML tags?

Thanks!

EDIT: The words within the child elements must get the wrapped around it. The HTML tags themselves must be ignored.

EDIT2: Rushed the example, did not provide proper use of string replace.

You can use a negative lookahead to make sure there's no closing angled bracket ahead of the word without an opening angled bracket:

\b(\w+(?![^<>]*>))\b
      ^^^^^^^^^^^

And I think you can safely remove the ? in \w+?, since you're matching whole words.