且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何仅解析HTML文件的一部分而忽略其余部分?

更新时间:2023-09-26 12:11:22

您是指第999行还是表格的第999行?

Do you mean the 999th line or the 999th table row?

前者可能是

perl -ne 'print if $. == 999' /path/to/*.dat

后者将包含HTML解析器和一些选择逻辑. Sax解析器可能更适合快速处理大量文件.这可能取决于所使用的HTML版本以及它是否格式正确".

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl有许多XML和HTML解析器-您是否有任何特定的模块在心中?

Perl has many XML and HTML parsers - did you have any particular module in mind?

您的问题似乎是您的XPath表达式.实际的HTML比 您的XPath建议.以下表达式效果更好

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

尝试将以上内容剪切并粘贴到文件中,然后使用Perl运行它.

Try cutting the above and pasting into a file then running it with Perl.