更新时间:2023-11-24 11:02:52
关于XML和awk问题,您经常会发现专家的评论(如果名称中包含 k 的话),awk中的XML处理很复杂或不足。正如我理解这个问题,脚本需要用于个人和/或调试目的。为此,我的解决方案应该足够了,但请记住,它不适用于任何合法的XML文件。
根据您的描述,脚本是:
如果< trans *>
匹配如果找到< trpcAuthCode>
,则获取其内容并与列表进行比较。
< / trans>
匹配停止记录。如果输出已启用,则打印记录的块,否则放弃它。 因为我在SO:Shell脚本 - 将xml分成多个文件这应该不会太多很难实现。 虽然需要一个附加功能:将AuthNumbers数组提供给脚本。由于一个惊人的巧合,我今天早上在 SO:如何访问awk中的数组,这是在shell中的另一个awk中声明的?(感谢 jas )。
因此,把它放在脚本 filter-trpcAuthCode.awk
:
BEGIN {
record = 0#记录的状态
buffer =#记录的缓冲区
找到的= 0#找到的授权码的状态
#构建温度。 (authCodes,list,\\\
)
#构建最终数组,其中的值成为键
(列表中的i)authCodeList [ (authCodeList){
print authCode $ b $()){
$ for debug:输出authCodeList
print<! - authCodeList: b}
print - >
}
/< trans([^>] *)?> / {
record = 1#开始记录
buffer =#clear缓冲区
发现= 0#发现认证码的状态
$ b记录{
缓冲区=缓冲区\ n$ 0记录行(如果记录是启用)
}
记录&& /< trpcAuthCode> / {
#提取授权代码
authCode = gensub(/^.*>([^ #检查auth代码中的auth代码是否在authCodeList中
found = authCodeList中的authCode
}
/< \ / trans> ; / {
记录= 0#停止记录
#如果找到认证码,打印缓冲区
如果(找到){
打印缓冲区
}
}
注:
中对 authCodes
应用 split()
时挣扎最初> BEGIN
。这使分割值与枚举键存储在一起的数组。因此,我寻找一种解决方案,使数值本身成为数组的关键。 (否则,运算符中的不能用于搜索。)我在 SO:检查数组是否包含值。我执行了建议的模式 code>< trans *>
作为 /< trans([^>] *)?/
< trans>
(尽管< trans>
似乎永远不会出现没有属性的情况),但不会出现< transSet>
。 buffer = buffer \\\
$ 0
$ 0
包含没有换行符的行。因此,它必须重新插入。我是如何做到的,缓冲区以换行符开始,但最后一行没有结束。考虑到打印缓冲区
在文本末尾添加了换行符,这对我来说很好。或者,上面的代码片段可以被替换为: buffer = buffer $ 0\ n
buffer =(buffer!=?buffer\\\
:)$ 0
。过滤的文件简单地打印到标准输出通道。它可能被重定向到一个文件。考虑到这一点,我将附加/调试输出格式化为XML注释。
next
语句。这是有意的。换句话说,规则的顺序是精心挑选的,这样一条线就可以被所有规则连续处理/影响。 (我测试了一个极端情况:< trans>< trpcAuthCode> 111222< / trpcAuthCode>< / trans>
为了简化测试,我添加了一个封装bash脚本 filter-trpcAuthCode.sh
#!/ usr / bin / bash
#取消注释下一行以进行调试
#set -x
#检查命令行参数
if [[$#-ne 2]];那么
回显错误:非法的命令行参数数量!
echo
echo用法:
echo $(basename $ 0)XML_FILE AUTH_CODES
exit 1
fi
#call awk script
awk -v authCodes =$(cat
我使用示例文件测试了脚本(在Windows 10上使用cygwin中的bash) main.xml
并获得了四个匹配的块。我有点担心输出,因为在您的示例输出中 transaction_results。 xml 只有三个匹配的块。但是通过视觉检查我的输出结果似乎是合适的。 (所有这四个匹配都包含一个匹配的< trpcAuthCode>
元素。)
示例 sample.xml
:
<?xml version = 1.0\" >?;
< transSet periodID =1periodname =ShiftlongId =2017-04-27shortId =052site =12345>
< trans type =periodClose>
< trHeader>
< / trHeader>
< / trans>
< printCashier>
< cashier sysid =7empNum =07posNum =101period =11> A.Dude< / cashier>
< / printCashier>
< trans type =printCashier>
< trHeader>
< cashier sysid =7empNum =07posNum =101period =11> A.Dude< / cashier>
< posNum> 101< / posNum>
< / trHeader>
< / trans>
< trans type =journal>
< trHeader>
< / trHeader>
< / trans>
< trans type =network salerecalled =false>
< trHeader>
< termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN>
< / trHeader>
< trPaylines>
< trPayline type =salesysid =1locale =DOLLAR>
< trpCardInfo>
< trpcAccount> 1234567890123456< / trpcAccount>
< trpcAuthCode> 532524< / trpcAuthCode>
< / trpCardInfo>
< / trPayline>
< / trPaylines>
< / trans>
< trans type =network salerecalled =false>
< trHeader>
< termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN>
< / trHeader>
< trPaylines>
< trPayline type =salesysid =1locale =DOLLAR>
< trpPaycode mop =3cat =1nacstendercode =genericnacstendersubcode =generic> CREDIT< / trpPaycode>
< trpAmt> 61.77< / trpAmt>
< trpCardInfo>
< trpcAccount> 2345678901234567< / trpcAccount>
< trpcAuthCode> 111222< / trpcAuthCode>
< / trpCardInfo>
< / trPayline>
< / trPaylines>
< / trans>
< trans type =periodClose>
< trHeader>
< date> 2017-04-27T23:50:17-04:00< / date>
< / trHeader>
< / trans>
< endTotals>
< insideSales> 445938.63< / insideSales>
< / endTotals>
< / transSet>
对于其他示例输入,我只是将文本复制到文件 authCodes中。 txt
:
111222
111333
111444
在示例会话中使用两个输入文件:
$ ./filter-xml-trpcAuthCode.sh
错误:非法数量的命令行参数!
用法:
filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES
$ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
<! - authCodeList:
111222
111333
111444
- >
< trans type =network salerecalled =false>
< trHeader>
< termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN>
< / trHeader>
< trPaylines>
< trPayline type =salesysid =1locale =DOLLAR>
< trpPaycode mop =3cat =1nacstendercode =genericnacstendersubcode =generic> CREDIT< / trpPaycode>
< trpAmt> 61.77< / trpAmt>
< trpCardInfo>
< trpcAccount> 2345678901234567< / trpcAccount>
< trpcAuthCode> 111222< / trpcAuthCode>
< / trpCardInfo>
< / trPayline>
< / trPaylines>
< / trans>
$ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt> output.txt
$
最后一个命令将输出重定向到一个文件 output.txt
,这个文件可能会在以后检查或处理。
I revamped this question since I've been reading a bit on XML.
I have a file source file that contains a list of AuthNumbers.
111222
111333
111444
etc.
I need to search for the numbers in that list and find them in a corresponding XML file.
In the xml file the line is formatted as such:
<trpcAuthCode>111222</trpcAuthCode>
This can be achieved quite painlessly using grep however I require the entire block containing the transaction.
The block starts with:
<trans type="network sale" recalled="false">
or <trans type="network sale" recalled="false" rollback="true">
and/or some other variations. Actually <trans*>
would be best if something like that is possible.
The block ends with </trans>
It doesn't need to be elegant or efficient. I just need it to work. I suspect some transactions are dropping out and I need a quick way to vet the ones that are not being processed.
If it helps here is a link to the original (sterilized) xml https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0
And what I would like to extract: https://www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml?dl=0
The size of each result will vary as each transaction can vary greatly in length depending on the amount of products purchased. In the results xml you see that I extracted the xml I need based on the trpcAuthCode list 111222,111333,111444.
Concerning XML and awk questions, you often find comments of the gurus (the one if a k in their reputation) that XML processing in awk is complicated or not sufficient. As I understood the question, the script is needed for personal and/or debugging purposes. For this, my solution should be sufficient but, please, keep in mind that it will not work on any legal XML file.
Based on your description, the sketch of the script is:
If <trans*>
is matched start recording.
If <trpcAuthCode>
is found get its contents and compare with the list. In case of match, remember block for output.
If </trans>
is matched stop recording. If output has been enabled print recorded block otherwise discard it.
Because I did something similar in SO: Shell scripting - split xml into multiple files this should become not too hard to implmenent.
Though, one additional feature is necessary: feeding the AuthNumbers array into the script. Due to a surprising coincidence, I learnt the answer just this morning in SO: How to access an array in an awk, which is declared in a different awk in shell? (thanks to the comment of jas).
So, putting it altogether in a script filter-trpcAuthCode.awk
:
BEGIN {
record = 0 # state for recording
buffer = "" # buffer for recording
found = 0 # state for found auth code
# build temp. array from authCodes which has to be pre-defined
split(authCodes, list, "\n")
# build final array where values become keys
for (i in list) authCodeList[list[i]]
# for debugging: output of authCodeList
print "<!-- authCodeList:"
for (authCode in authCodeList) {
print authCode
}
print "-->"
}
/<trans( [^>]*)?>/ {
record = 1 # start recording
buffer = "" # clear buffer
found = 0 # reset state for found auth code
}
record {
buffer = buffer"\n"$0 # record line (if recording is enabled)
}
record && /<trpcAuthCode>/ {
# extract auth code
authCode = gensub(/^.*>([^<]*)<\/trpcAuthCode.*$/, "\\1", "g")
# check whether auth code in authCodeList
found = authCode in authCodeList
}
/<\/trans>/ {
record = 0 # stop recording
# print buffer if auth code has been found
if (found) {
print buffer
}
}
Notes:
I struggled initially when applying the split()
on authCodes
in BEGIN
. This makes an array where the split values are stored with enumerated keys. Thus, I looked for a solution to make the values itself keys of the array. (Otherwise, the in
operator cannot be used for search.) I found an elegant solution in the accepted answer of SO: Check if array contains value.
I implemented the proposed pattern <trans*>
as /<trans( [^>]*)?/
which will even match <trans>
(although <trans>
seems never to occur without attributes) but not <transSet>
.
Thebuffer = buffer"\n"$0
appends the current line to the previous contents. The $0
contains the line without the newline character. Thus, it has to be re-inserted. How I did it, the buffer starts with a newline but the last line ends without. Considering that the print buffer
adds a newline at the end of text this is fine for me. Alternatively, the above snippet could be replaced bybuffer = buffer $0 "\n"
or evenbuffer = (buffer != "" ? buffer"\n" : "") $0
.
(It's a matter of taste.)
The filtered file is simply printed to standard output channel. It might be redirected to a file. Considering this, I formatted the additional/debug output as XML comment.
If your are a little bit familiar with awk you may notice that there isn't any next
statement in my script. This is by intention. In other words, the order of rules is well-chosen so that a line may be processed/affected consecutively by all rules. (I tested an extreme case:<trans><trpcAuthCode>111222</trpcAuthCode></trans>
and even this is processed correctly.)
To simplify testing I added a wrapper bash script filter-trpcAuthCode.sh
#!/usr/bin/bash
# uncomment next line for debugging
#set -x
# check command line arguments
if [[ $# -ne 2 ]]; then
echo "ERROR: Illegal number of command line arguments!"
echo ""
echo "Usage:"
echo $(basename $0) " XML_FILE AUTH_CODES"
exit 1
fi
# call awk script
awk -v authCodes="$(cat <$2)" -f filter-xml-trpcAuthCode.awk "$1"
I tested the scripts (with bash in cygwin on Windows 10) against your sample file main.xml
and got four matching blocks. I was a little bit concerned about the output because in your sample output transaction_results.xml are only three matching blocks. But checking my output visually it seems to be appropriate. (All four hits contained a matching <trpcAuthCode>
element.)
I reduced your sample input a little bit for demonstration sample.xml
:
<?xml version="1.0"?>
<transSet periodID="1" periodname="Shift" longId="2017-04-27" shortId="052" site="12345">
<trans type="periodClose">
<trHeader>
</trHeader>
</trans>
<printCashier>
<cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
</printCashier>
<trans type="printCashier">
<trHeader>
<cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
<posNum>101</posNum>
</trHeader>
</trans>
<trans type="journal">
<trHeader>
</trHeader>
</trans>
<trans type="network sale" recalled="false">
<trHeader>
<termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
</trHeader>
<trPaylines>
<trPayline type="sale" sysid="1" locale="DOLLAR">
<trpCardInfo>
<trpcAccount>1234567890123456</trpcAccount>
<trpcAuthCode>532524</trpcAuthCode>
</trpCardInfo>
</trPayline>
</trPaylines>
</trans>
<trans type="network sale" recalled="false">
<trHeader>
<termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
</trHeader>
<trPaylines>
<trPayline type="sale" sysid="1" locale="DOLLAR">
<trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
<trpAmt>61.77</trpAmt>
<trpCardInfo>
<trpcAccount>2345678901234567</trpcAccount>
<trpcAuthCode>111222</trpcAuthCode>
</trpCardInfo>
</trPayline>
</trPaylines>
</trans>
<trans type="periodClose">
<trHeader>
<date>2017-04-27T23:50:17-04:00</date>
</trHeader>
</trans>
<endTotals>
<insideSales>445938.63</insideSales>
</endTotals>
</transSet>
For the other sample input I simply copied the text into a file authCodes.txt
:
111222
111333
111444
Using both input files in the sample session:
$ ./filter-xml-trpcAuthCode.sh
ERROR: Illegal number of command line arguments!
Usage:
filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES
$ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
<!-- authCodeList:
111222
111333
111444
-->
<trans type="network sale" recalled="false">
<trHeader>
<termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
</trHeader>
<trPaylines>
<trPayline type="sale" sysid="1" locale="DOLLAR">
<trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
<trpAmt>61.77</trpAmt>
<trpCardInfo>
<trpcAccount>2345678901234567</trpcAccount>
<trpcAuthCode>111222</trpcAuthCode>
</trpCardInfo>
</trPayline>
</trPaylines>
</trans>
$ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt >output.txt
$
The last command re-directs output to a file output.txt
which may be inspected or processed afterwards.