且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何解析XML中的Bash?

更新时间:2023-02-18 14:31:30

这是真的只是 Yuzem的一个交代回答,但我不喜欢这么多的编辑应该做别人,和评论不允许格式化,所以......

  RDOM(){本地IFS = \\> ;阅读-d \\<权证;}

让我们称之为read_dom而不是RDOM,空间出来一点,使用更长的变量:

  read_dom(){
    当地IFS = \\>
    阅读-d \\<体内容
}

好了,所以它定义了一个名为read_dom功能。第一行使得IFS(输入字段分隔符)局部于这个功能,改变它>。这意味着,当你读空格,制表符,换行符上的数据,而不是自动被拆分沾到'>'分裂。下一行表示,从stdin读取输入,而不是在一个换行符停止,停止,当你看到一个'<'字符(-d为deliminator标志)。什么被读取然后使用IFS分割并分配给该变量ENTITY和内容。因此,采取以下内容:

 <标签> VALUE< /标签>

read_dom 第一个调用得到一个空字符串(因为'<'是第一个字符)。这得到由IFS分为只是'',因为没有一个'>'字符。阅读然后指定一个空字符串这两个变量。第二次调用获取字符串'标签>价值。这得到由IFS分割再进两个字段'标签'和'价值'。阅读然后分配变量,如:实体=标记 CONTENT =值。第三个电话得到字符串'/标签>。这得到由IFS分成两个字段'/标签'和''。阅读然后分配变量,如:实体= /标签 CONTENT = 。因为我们已经达到文件末尾的第四个调用将返回一个非零状态。

现在他while循环,清理了一下,以匹配上面的:

 而read_dom;做
    如果[[$实体=标题]];然后
        回声$内容
        出口
    科幻
完成< xhtmlfile.xhtml> titleOfXHTMLPage.txt

第一行只是说,而read_dom functionreturns零状态,做到以下几点。第二行检查,如果我们刚才看到的实体称号。下一行回声标记的内容。四线退出。如果不是的标题实体然后循环重复上第六行。我们重定向xhtmlfile.xhtml为标准输入(用于 read_dom 功能)和标准输出重定向到titleOfXHTMLPage.txt(即从早期在环回声)。

现在给出以下(类似于你在上市的S3桶得到什么)为的input.xml

 < ListBucketResult的xmlns =htt​​p://s3.amazonaws.com/doc/2006-03-01/>
  <名称> STH-项目< /名称>
  < IsTruncated>假LT; / IsTruncated>
  <内容>
    &LT;关键&GT; item-apple-iso@2x.png< / Key与GT;
    &LT;&上次更改GT; 2011-07-25T22:23:04.000Z&LT; /即早&GT;
    &LT;&ETag的GT;&安培; QUOT; 0032a28286680abee71aed5d059c6a09&安培; QUOT;&LT; / ETag的&GT;
    &LT;尺寸和GT; 1785&LT; /尺寸&GT;
    &LT; StorageClass&GT;的标准和LT; / StorageClass&GT;
  &LT; /内容&gt;
&LT; / ListBucketResult&GT;

和下面的循环:

 而read_dom;做
    回声$实体=&GT; $ CONTENT
完成&LT; input.xml中

您应该:​​

  =&GT;
ListBucketResult的xmlns =htt​​p://s3.amazonaws.com/doc/2006-03-01/=&GT;
名称=&GT; STH-项目
/名称=&GT;
IsTruncated =&GT;假
/ IsTruncated =&GT;
内容=&GT;
关键=&GT; item-apple-iso@2x.png
/主要= GT;
上次更改时间=&GT; 2011-07-25T22:23:04.000Z
/上次更改时间=&GT;
ETag的= GT; &安培; QUOT; 0032a28286680abee71aed5d059c6a09&安培; QUOT;
/ ETag的= GT;
大小=&GT; 1785
/尺寸=&GT;
StorageClass =&GT;标准
/ StorageClass =&GT;
/内容=&GT;

所以,如果我们写了一个,而循环像Yuzem的:

 而read_dom;做
    如果[[$实体=键]];然后
        回声$内容
    科幻
完成&LT; input.xml中

我们会得到所有的文件列表中的S3存储桶。

修改
如果由于某种原因本地IFS = \\&GT; 不会为你工作,你把它全球范围内,你应该像函数结束重置:

  read_dom(){
    ORIGINAL_IFS = $ IFS
    IFS = \\&GT;
    阅读-d \\&LT;体内容
    IFS = $ ORIGINAL_IFS
}

否则,你在剧本以后做任何分割线将被搞砸了。

编辑2
拆出的属性名称/值对可以充实到 read_dom()像这样:

  read_dom(){
    当地IFS = \\&GT;
    阅读-d \\&LT;体内容
    当地RET = $?
    TAG_NAME = $ {实体%% *}
    ATTRIBUTES = $ {#实体*}
    返回$ RET
}

然后再编写函数来解析,让你想这样的数据:

  parse_dom(){
    如果[[$ TAG_NAME =富]];然后
        当地的eval $ ATTRIBUTES
        回声富尺寸为:$大小
    ELIF [$ TAG_NAME =酒吧]];然后
        当地的eval $ ATTRIBUTES
        回声栏的类型是:$型
    科幻
}

然后当你 read_dom 通话 parse_dom

 而read_dom;做
    parse_dom
DONE

然后给下面的示例标记:

 &LT;示例&gt;
  &LT;小型=bar_sizeTYPE =金属&GT;酒吧内容与LT; /酒吧和GT;
  &LT;富大小=1789TYPE =未知&GT; FOOS内容和LT; / foo的&GT;
&LT; /示例&gt;

您应该得到这样的输出:

  $猫的example.xml | ./bash_xml.sh
酒吧类型:金属
富尺寸为:1789

编辑3 表示,他们正在与它有问题的另一个用户在FreeBSD和建议,从节约读的退出状态,并在像read_dom年底回国吧:

  read_dom(){
    当地IFS = \\&GT;
    阅读-d \\&LT;体内容
    当地RET = $?
    TAG_NAME = $ {实体%% *}
    ATTRIBUTES = $ {#实体*}
    返回$ RET
}

我看不出有任何理由为什么不应该工作

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=\> ; read -d \< E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>item-apple-iso@2x.png</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

and the following loop:

while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml

You should get:

 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => item-apple-iso@2x.png
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents => 

So if we wrote a while loop like Yuzem's:

while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT If for some reason local IFS=\> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=\>
    read -d \< ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2 To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}

Then write your function to parse and get the data you want like this:

parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}

Then while you read_dom call parse_dom:

while read_dom; do
    parse_dom
done

Then given the following example markup:

<example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}

I don't see any reason why that shouldn't work