且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从文本中删除所有HTML标签+内容

更新时间:2023-01-25 18:39:19

地狱之路:

  $ str = substr($ str,1); 
$ lines = explode(\\\
|,$ str);

$ result = array();

$ pattern ='〜
#子模式定义
(?(DEFINE)
(?< c><! - 。*? ;)#html comment
(?< tag>#标签(可能具有相同名称的嵌套标签)
(<(\w ++)
(?> [^ ] ++ | \g< c> |<(?!/?\g {-1})|(?-2))*
< / \g {-1}> )

(?< sctag>< / w ++ [^>] *>)#self closing tag

#main pattern
\\ \\g&LT c取代; | \g&LT;标记&GT; | \g&LT; sctag&GT; | \s + $
〜x';

foreach($ lines as $ line){
$ kv = explode('=',$ line,2);

$ kv [1] =(isset($ kv [1]))? preg_replace($ pattern,'',$ kv [1]):null;

$ result [$ kv [0]] = $ kv [1];
}
unset($ kv,$ pattern,$ lines,$ str);
echo'< pre>'。 htmlspecialchars(print_r($ result,true))。 &LT; /预&GT;;

注意1:由于字符串包含不常见的标签(即不是html标签的标签),因此这些标签可能是同时可以自动关闭标签。换句话说,您可以找到< ref> ....< / ref> < ref /> (或< ref> 作为自我关闭标签)在同一文档中。要处理这种具体情况,您可以将标记子模式定义的中间行更改为:(?> [^<] ++ | \g< c> |<(? /?\g {-1})|(?-2)|< \g {-1} \b [^] *?/?>)* / p>

注意2:如果您不想使用正则表达式,则使用DOM,但由于< ref> 不存在于HTML中,您必须编写自己的描述此标签的DTD(和所有其他html标签),将其添加到字符串中,并使用 loadXML 方法 DOMDocument 类。


OK, so as straightforward as it may seem, I'm still not able to do properly. I've tried with RegEx, I've even made an attempt at DOM parsing, but still not able to get it right.

Based on an answer in a previous question of mine (Trying to remove HTML tags (+ content) from String), this is what I've ended up with :

   public static function removeHtmlTags($str) { 
        $dom = new DOMDOcument();
        $errorState = libxml_use_internal_errors(true);
        $dom->loadHTML($str);

        $xpath = new DOMXPath($dom);
        $node = $xpath->query('//body/p/text()')->item(0);

        if (isset($node->textContent)) $ret = $node->textContent;
        else $ret="";

        libxml_use_internal_errors($errorState);

        return $ret;
    }

It seemingly does the trick most of the time, however here's the catch...

This (well, if you can't recognize what it is, it's a Wikipedia Infobox) :

|conventional_long_name = Italian Republic
|native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}}
|common_name = Italy
|nickname(s) = Il Belpaese
|image_flag = Flag of Italy.svg
|image_coat = Italy-Emblem.svg
|symbol_type = Emblem
|image_map = EU-Italy.svg
|map_caption = {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}}
|national_anthem = {{native name|it|[[Il Canto degli Italiani]]}}<br/>{{small|''The Song of the Italians''}} [[File:Inno di Mameli instrumental.ogg|center]]
|official_languages = [[Italian language|Italian]]<sup>a</sup>
|Religion= [[Roman Catholic]]
|capital = {{Coat of arms|Rome}}
|latd=41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E
|largest_city = capital
|largest_metropolitan area = {{hlist |[[Milan]] |[[Naples]]}}
|demonym = [[Italians|Italian]]
|government_type = [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]]
|leader_title1 = [[President of Italy|President]]
|leader_name1 = [[Giorgio Napolitano]]
|leader_title2 = [[Prime Minister of Italy|Prime Minister]]
|leader_name2 = [[Enrico Letta]]
|leader_title3 = [[List of Presidents of the Senate of Italy|President of the Senate]]
|leader_name3 = [[Pietro Grasso]]
|leader_title4 = [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]]
|leader_name4 = [[Laura Boldrini]]
|legislature = [[Parliament of Italy|Parliament]]
|upper_house = [[Italian Senate|Senate of the Republic]]
|lower_house = [[Italian Chamber of Deputies|Chamber of Deputies]]
|accessionEUdate = 25 March 1957 (founding member)
|EUseats = 78
|area_rank = 72nd
|area_magnitude = 1 E11
|area_km2 = 301,338
|area_sq_mi = 116,347 <!--Do not remove per [[WP:MOSNUM]]-->
|percent_water = 2.4
|population_census = 59,433,744<ref name="Istat">{{cite web |url=http://www.istat.it/it/files/2012/12/volume_popolazione-legale_XV_censimento_popolazione.pdf|title=Census 2011 - final results |publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=19 December 2012}}</ref>
|population_census_year = 2011
|population_census_rank = 23rd
|population_estimate = 59,685,227<ref>{{cite web |url=http://www.istat.it/en/archive/94537|title=Resident population and population change|publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=25 June 2013}}</ref>
|population_estimate_year = 2012
|population_estimate_rank = 23rd
|population_density_rank = 63rd
|population_density_km2 = 197.7
|population_density_sq_mi = 511.6 <!--Do not remove per [[WP:MOSNUM]]-->
|GDP_PPP = $1.848 trillion<ref name=autogenerated1 >{{cite web |url=http://www.imf.org/external/pubs/ft/weo/2013/02/weodata/weorept.aspx?pr.x=25&pr.y=1&sy=2013&ey=2013&scsm=1&ssd=1&sort=country&ds=.&br=1&c=136&s=NGDPD%2CNGDPDPC%2CPPPGDP%2CPPPPC&grp=0&a= |title=Italy |publisher=International Monetary Fund |accessdate=17 October 2013}}</ref>
|GDP_PPP_rank = 11th
|GDP_PPP_year = 2014
|GDP_PPP_per_capita = $30,218<ref name=autogenerated1/>
|GDP_PPP_per_capita_rank = 34th
|GDP_nominal = $2.148 trillion<ref name=autogenerated1/>
|GDP_nominal_rank = 9th
|GDP_nominal_year = 2014
|GDP_nominal_per_capita = $35,123<ref name=autogenerated1/>
|GDP_nominal_per_capita_rank = 27th
|sovereignty_type = [[History of Italy|Formation]]
|established_event1 = [[Italian unification|Unification]]
|established_date1 = 17 March 1861
|established_event2 = [[Italian constitutional referendum, 1946|Republic]]
|established_date2 = 2 June 1946
|Gini_year = 2011
|Gini_change =  <!--increase/decrease/steady-->
|Gini = 31.9 <!--number only-->
|Gini_ref = <ref name=eurogini>{{cite web|title=Gini coefficient of equivalised disposable income (source: SILC)|url=http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ilc_di12|publisher=Eurostat Data Explorer|accessdate=13 August 2013}}</ref>
|Gini_rank =
|HDI_year = 2013
|HDI_change = increase <!--increase/decrease/steady-->
|HDI = 0.881 <!--number only-->
|HDI_ref = <ref name="HDI">{{cite web |url=http://hdr.undp.org/en/media/HDR_2011_EN_Table1.pdf |title=Human Development Report 2011 |year=2011 |publisher=United Nations |accessdate=5 November 2011}}</ref>
|HDI_rank = 25th
|currency = Euro ([[Euro sign|€]])<sup>b</sup>
|currency_code = EUR
|country_code =
|time_zone = [[Central European Time|CET]]
|utc_offset = +1
|time_zone_DST = [[Central European Summer Time|CEST]]
|utc_offset_DST = +2
|drives_on = right
|calling_code = [[Telephone numbers in Italy|39]]<sup>c</sup>
|cctld = [[.it]]<sup>d</sup>
|footnote_a = <span style="font-size:100%;">French is co-official in the [[Aosta Valley]]; [[Slovene language|Slovene]] is co-official in the [[province of Trieste]] and the [[province of Gorizia]]; German and [[Ladin language|Ladin]] are co-official in [[South Tyrol]].</span>

|footnote_b = <span style="font-size:100%;">Before 2002, the [[Italian lira|Italian Lira]]. The euro is accepted in [[Campione d'Italia]], but the official currency there is the [[Swiss Franc]].<ref>{{cite web |url=http://www.comune.campione-d-italia.co.it/ |title=Comune di Campione d'Italia |publisher=Comune.campione-d-italia.co.it |date=14 July 2010 |accessdate=30 October 2010}}</ref></span>
|footnote_c = <span style="font-size:100%;">To call [[Campione d'Italia]], it is necessary to use the Swiss code [[+41]].</span>
|footnote_d = <span style="font-size:100%;">The [[.eu]] domain is also used, as it is shared with other [[European Union]] member states.</span>

becomes (after also explodeing the newlines) :

Array
(
    [conventional_long_name] => Italian Republic
    [native_name] => {{lang|it|''Repubblica italiana
    [common_name] => Italy
    [nickname(s)] => Il Belpaese
    [image_flag] => Flag of Italy.svg
    [image_coat] => Italy-Emblem.svg
    [symbol_type] => Emblem
    [image_map] => EU-Italy.svg
    [map_caption] => {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}}
    [national_anthem] => {{native name|it|[[Il Canto degli Italiani]]}}
    [official_languages] => [[Italian language|Italian]]
    [Religion] => [[Roman Catholic]]
    [capital] => {{Coat of arms|Rome}}
    [latd] => 41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E
    [largest_city] => capital
    [largest_metropolitan area] => {{hlist |[[Milan]] |[[Naples]]}}
    [demonym] => [[Italians|Italian]]
    [government_type] => [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]]
    [leader_title1] => [[President of Italy|President]]
    [leader_name1] => [[Giorgio Napolitano]]
    [leader_title2] => [[Prime Minister of Italy|Prime Minister]]
    [leader_name2] => [[Enrico Letta]]
    [leader_title3] => [[List of Presidents of the Senate of Italy|President of the Senate]]
    [leader_name3] => [[Pietro Grasso]]
    [leader_title4] => [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]]
    [leader_name4] => [[Laura Boldrini]]
    [legislature] => [[Parliament of Italy|Parliament]]
    [upper_house] => [[Italian Senate|Senate of the Republic]]
    [lower_house] => [[Italian Chamber of Deputies|Chamber of Deputies]]
    [accessionEUdate] => 25 March 1957 (founding member)
    [EUseats] => 78
    [area_rank] => 72nd
    [area_magnitude] => 1 E11
    [area_km2] => 301,338
    [area_sq_mi] => 116,347 
    [percent_water] => 2.4
    [population_census] => 59,433,744
    [population_census_year] => 2011
    [population_census_rank] => 23rd
    [population_estimate] => 59,685,227
    [population_estimate_year] => 2012
    [population_estimate_rank] => 23rd
    [population_density_rank] => 63rd
    [population_density_km2] => 197.7
    [population_density_sq_mi] => 511.6 
    [GDP_PPP] => $1.848 trillion
    [GDP_PPP_rank] => 11th
    [GDP_PPP_year] => 2014
    [GDP_PPP_per_capita] => $30,218
    [GDP_PPP_per_capita_rank] => 34th
    [GDP_nominal] => $2.148 trillion
    [GDP_nominal_rank] => 9th
    [GDP_nominal_year] => 2014
    [GDP_nominal_per_capita] => $35,123
    [GDP_nominal_per_capita_rank] => 27th
    [sovereignty_type] => [[History of Italy|Formation]]
    [established_event1] => [[Italian unification|Unification]]
    [established_date1] => 17 March 1861
    [established_event2] => [[Italian constitutional referendum, 1946|Republic]]
    [established_date2] => 2 June 1946
    [Gini_year] => 2011
    [Gini_change] => 
    [Gini] => 31.9 
    [Gini_ref] => 
    [HDI_year] => 2013
    [HDI_change] => increase 
    [HDI] => 0.881 
    [HDI_ref] => 
    [HDI_rank] => 25th
    [currency] => Euro ([[Euro sign|â¬]])
    [currency_code] => EUR
    [time_zone] => [[Central European Time|CET]]
    [utc_offset] => +1
    [time_zone_DST] => [[Central European Summer Time|CEST]]
    [utc_offset_DST] => +2
    [drives_on] => right
    [calling_code] => [[Telephone numbers in Italy|39]]
    [cctld] => [[.it]]
    [footnote_a] => 
    [footnote_b] => 
    [footnote_c] => 
    [footnote_d] => 
)

And I'm wondering :

What happened to |native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}}

Can't that be :

|native_name = {{lang|it|''Repubblica italiana''}}

Instead, it seems to be getting rid of both the HTML comment and the text that follows.

Any ideas?

A way from Hell:

$str = substr($str, 1);
$lines = explode("\n|", $str);

$result = array();

$pattern = '~
# subpattern definitions
(?(DEFINE)
    (?<c> <!--.*?--> )      # html comment
    (?<tag>                 # tag (possible nested tags with the same name)
        (   <(\w++)
            (?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) )*
            </\g{-1}> ) 
    )
    (?<sctag> </w++[^>]*> ) # self closing tag 
)
# main pattern
\g<c> | \g<tag> | \g<sctag> | \s+$
~x';

foreach($lines as $line) {
    $kv = explode(' = ', $line, 2);

    $kv[1] = (isset($kv[1])) ? preg_replace($pattern, '', $kv[1]) : null;

    $result[$kv[0]] = $kv[1];
}
unset($kv, $pattern, $lines, $str);
echo '<pre>' . htmlspecialchars(print_r($result, true)) . '</pre>';

note 1: Since the string contains uncommon tags (i.e. tags that are not html tags), it is possible that these tags can be self closing tags or not at the same time. In other words, you can find <ref>....</ref> and <ref/> (or <ref> as self closing tag) in the same document. To deal with this specific case, you can change the middle line of the tag subpattern definition to: (?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) | <\g{-1}\b[^>]*?/?> )*

note 2: If you don't want to use regex, the way is to use the DOM, but since the tag <ref> doesn't exist in html, you must write your own DTD that describes this tag (and all other html tags), add it to your string, and use the loadXML method of DOMDocument class.