且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

获取jsoup中元素的字符偏移量

更新时间:2022-12-29 19:41:12

我不认为Jsoup具有此功能.这个问题似乎比HTML分析更接近词法分析.

I don't believe Jsoup has this functionality. This question seems closer to lexical analysis than HTML parsing.

我将编写一个语法,然后针对该语法编写一个词法分析器,该词法分析器将标记HTML,并提供您要查找的偏移量.

I would write a grammar, and then write a lexer against that grammar which would tokenize the HTML, and supply the offsets that you're looking for.

首先,使用Jsoup解析文档以验证其是否为有效的HTML.

First, parse the document with Jsoup to verify that it is valid HTML.

然后,根据语法对文档进行词法分析.语法可能像这样:

Then, lexically analyze the document against a grammar. A grammar might look like:

Document := {optional-opening-tag} | {literal} {optional-opening-tag} | {optional-closing-tag}

optional-opening-tag := ["<" {literal} ">" {optional-opening-tag}|{literal} ] | ""

optional-closing-tag := "</ {literal} ">" | ""

literal := any string of characters not beginning with whitespace, or containing "<"

将在存储令牌,第一个字符的索引和长度的对象中找到的每个令牌插入.

Insert each token that you find in an object which stores the token, the index of the first character, and the length.