且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用源文件中的数据从XML文件中获取块

更新时间:2023-11-24 11:02:52

关于XML和awk问题,您经常会发现专家的评论(如果名称中包含 k 的话),awk中的XML处理很复杂或不足。正如我理解这个问题,脚本需要用于个人和/或调试目的。为此,我的解决方案应该足够了,但请记住,它不适用于任何合法的XML文件。



根据您的描述,脚本是:


  1. 如果< trans *> 匹配如果找到< trpcAuthCode> ,则获取其内容并与列表进行比较。

  2. 如果匹配,记住块输出。 如果< / trans> 匹配停止记录。如果输出已启用,则打印记录的块,否则放弃它。 因为我在SO:Shell脚本 - 将xml分成多个文件这应该不会太多很难实现。

    虽然需要一个附加功能:将AuthNumbers数组提供给脚本。由于一个惊人的巧合,我今天早上在 SO:如何访问awk中的数组,这是在shell中的另一个awk中声明的?(感谢 jas )。



    因此,把它放在脚本 filter-trpcAuthCode.awk

      BEGIN {
    record = 0#记录的状态
    buffer =#记录的缓冲区
    找到的= 0#找到的授权码的状态
    #构建温度。 (authCodes,list,\\\

    #构建最终数组,其中的值成为键
    (列表中的i)authCodeList [ (authCodeList){
    print authCode $ b $()){
    $ for debug:输出authCodeList
    print<! - authCodeList: b}
    print - >
    }

    /< trans([^>] *)?> / {
    record = 1#开始记录
    buffer =#clear缓冲区
    发现= 0#发现认证码的状态

    $ b记录{
    缓冲区=缓冲区\ n$ 0记录行(如果记录是启用)
    }

    记录&& /< trpcAuthCode> / {
    #提取授权代码
    authCode = gensub(/^.*>([^ #检查auth代码中的auth代码是否在authCodeList中
    found = authCodeList中的authCode
    }

    /< \ / trans&gt ; / {
    记录= 0#停止记录
    #如果找到认证码,打印缓冲区
    如果(找到){
    打印缓冲区
    }
    }

    注:

      $ b $我在中对 authCodes 应用 split()时挣扎最初> BEGIN 。这使分割值与枚举键存储在一起的数组。因此,我寻找一种解决方案,使数值本身成为数组的关键。 (否则,运算符中的不能用于搜索。)我在 SO:检查数组是否包含值我执行了建议的模式 code>< trans *> 作为 /< trans([^>] *)?/ < trans> (尽管< trans> 似乎永远不会出现没有属性的情况),但不会出现< transSet>
      buffer = buffer \\\
      $ 0


      将当前行追加到以前的内容。 $ 0 包含没有换行符的行。因此,它必须重新插入。我是如何做到的,缓冲区以换行符开始,但最后一行没有结束。考虑到打印缓冲区在文本末尾添加了换行符,这对我来说很好。或者,上面的代码片段可以被替换为:
      buffer = buffer $ 0\ n

      甚至 buffer =(buffer!=?buffer\\\
      :)$ 0

      这是一个有趣的问题。)
    1. 过滤的文件简单地打印到标准输出通道。它可能被重定向到一个文件。考虑到这一点,我将附加/调试输出格式化为XML注释。

    2. 如果你对awk有点熟悉,你可能会注意到没有任何在我的脚本中 next 语句。这是有意的。换句话说,规则的顺序是精心挑选的,这样一条线就可以被所有规则连续处理/影响。 (我测试了一个极端情况:

      < trans>< trpcAuthCode> 111222< / trpcAuthCode>< / trans>

      ,甚至可以正确处理。)


      为了简化测试,我添加了一个封装bash脚本 filter-trpcAuthCode.sh

       #!/ usr / bin / bash 
      #取消注释下一行以进行调试
      #set -x
      #检查命令行参数
      if [[$#-ne 2]];那么
      回显错误:非法的命令行参数数量!
      echo
      echo用法:
      echo $(basename $ 0)XML_FILE AUTH_CODES
      exit 1
      fi
      #call awk script
      awk -v authCodes =$(cat

      我使用示例文件测试了脚本(在Windows 10上使用cygwin中的bash) main.xml 并获得了四个匹配的块。我有点担心输出,因为在您的示例输出中 transaction_results。 xml 只有三个匹配的块。但是通过视觉检查我的输出结果似乎是合适的。 (所有这四个匹配都包含一个匹配的< trpcAuthCode> 元素。)



      示例 sample.xml

       <?xml version = 1.0\" &GT?; 
      < transSet periodID =1periodname =ShiftlongId =2017-04-27shortId =052site =12345>
      < trans type =periodClose>
      < trHeader>
      < / trHeader>
      < / trans>
      < printCashier>
      < cashier sysid =7empNum =07posNum =101period =11> A.Dude< / cashier>
      < / printCashier>
      < trans type =printCashier>
      < trHeader>
      < cashier sysid =7empNum =07posNum =101period =11> A.Dude< / cashier>
      < posNum> 101< / posNum>
      < / trHeader>
      < / trans>
      < trans type =journal>
      < trHeader>
      < / trHeader>
      < / trans>
      < trans type =network salerecalled =false>
      < trHeader>
      < termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN>
      < / trHeader>
      < trPaylines>
      < trPayline type =salesysid =1locale =DOLLAR>
      < trpCardInfo>
      < trpcAccount> 1234567890123456< / trpcAccount>
      < trpcAuthCode> 532524< / trpcAuthCode>
      < / trpCardInfo>
      < / trPayline>
      < / trPaylines>
      < / trans>
      < trans type =network salerecalled =false>
      < trHeader>
      < termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN>
      < / trHeader>
      < trPaylines>
      < trPayline type =salesysid =1locale =DOLLAR>
      < trpPaycode mop =3cat =1nacstendercode =genericnacstendersubcode =generic> CREDIT< / trpPaycode>
      < trpAmt> 61.77< / trpAmt>
      < trpCardInfo>
      < trpcAccount> 2345678901234567< / trpcAccount>
      < trpcAuthCode> 111222< / trpcAuthCode>
      < / trpCardInfo>
      < / trPayline>
      < / trPaylines>
      < / trans>
      < trans type =periodClose>
      < trHeader>
      < date> 2017-04-27T23:50:17-04:00< / date>
      < / trHeader>
      < / trans>
      < endTotals>
      < insideSales> 445938.63< / insideSales>
      < / endTotals>
      < / transSet>

      对于其他示例输入,我只是将文本复制到文件 authCodes中。 txt

        111222 
      111333
      111444

      在示例会话中使用两个输入文件:

        $ ./filter-xml-trpcAuthCode.sh 
      错误:非法数量的命令行参数!

      用法:
      filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES

      $ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
      <! - authCodeList:
      111222
      111333
      111444
      - >

      < trans type =network salerecalled =false>
      < trHeader>
      < termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN>
      < / trHeader>
      < trPaylines>
      < trPayline type =salesysid =1locale =DOLLAR>
      < trpPaycode mop =3cat =1nacstendercode =genericnacstendersubcode =generic> CREDIT< / trpPaycode>
      < trpAmt> 61.77< / trpAmt>
      < trpCardInfo>
      < trpcAccount> 2345678901234567< / trpcAccount>
      < trpcAuthCode> 111222< / trpcAuthCode>
      < / trpCardInfo>
      < / trPayline>
      < / trPaylines>
      < / trans>

      $ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt> output.txt

      $

      最后一个命令将输出重定向到一个文件 output.txt ,这个文件可能会在以后检查或处理。


      I revamped this question since I've been reading a bit on XML.

      I have a file source file that contains a list of AuthNumbers. 111222 111333 111444 etc.

      I need to search for the numbers in that list and find them in a corresponding XML file. In the xml file the line is formatted as such: <trpcAuthCode>111222</trpcAuthCode>

      This can be achieved quite painlessly using grep however I require the entire block containing the transaction.

      The block starts with: <trans type="network sale" recalled="false"> or <trans type="network sale" recalled="false" rollback="true"> and/or some other variations. Actually <trans*> would be best if something like that is possible.

      The block ends with </trans>

      It doesn't need to be elegant or efficient. I just need it to work. I suspect some transactions are dropping out and I need a quick way to vet the ones that are not being processed.

      If it helps here is a link to the original (sterilized) xml https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0

      And what I would like to extract: https://www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml?dl=0

      The size of each result will vary as each transaction can vary greatly in length depending on the amount of products purchased. In the results xml you see that I extracted the xml I need based on the trpcAuthCode list 111222,111333,111444.

      Concerning XML and awk questions, you often find comments of the gurus (the one if a k in their reputation) that XML processing in awk is complicated or not sufficient. As I understood the question, the script is needed for personal and/or debugging purposes. For this, my solution should be sufficient but, please, keep in mind that it will not work on any legal XML file.

      Based on your description, the sketch of the script is:

      1. If <trans*> is matched start recording.

      2. If <trpcAuthCode> is found get its contents and compare with the list. In case of match, remember block for output.

      3. If </trans> is matched stop recording. If output has been enabled print recorded block otherwise discard it.

      Because I did something similar in SO: Shell scripting - split xml into multiple files this should become not too hard to implmenent.

      Though, one additional feature is necessary: feeding the AuthNumbers array into the script. Due to a surprising coincidence, I learnt the answer just this morning in SO: How to access an array in an awk, which is declared in a different awk in shell? (thanks to the comment of jas).

      So, putting it altogether in a script filter-trpcAuthCode.awk:

      BEGIN {
        record = 0 # state for recording
        buffer = "" # buffer for recording
        found = 0 # state for found auth code
        # build temp. array from authCodes which has to be pre-defined
        split(authCodes, list, "\n")
        # build final array where values become keys
        for (i in list) authCodeList[list[i]]
        # for debugging: output of authCodeList
        print "<!-- authCodeList:"
        for (authCode in authCodeList) {
          print authCode
        }
        print "-->"
      }
      
      /<trans( [^>]*)?>/ {
        record = 1 # start recording
        buffer = "" # clear buffer
        found = 0 # reset state for found auth code
      }
      
      record {
        buffer = buffer"\n"$0 # record line (if recording is enabled)
      }
      
      record && /<trpcAuthCode>/ {
        # extract auth code
        authCode = gensub(/^.*>([^<]*)<\/trpcAuthCode.*$/, "\\1", "g")
        # check whether auth code in authCodeList
        found = authCode in authCodeList
      }
      
      /<\/trans>/ {
        record = 0 # stop recording
        # print buffer if auth code has been found
        if (found) {
          print buffer
        }
      }
      

      Notes:

      1. I struggled initially when applying the split() on authCodes in BEGIN. This makes an array where the split values are stored with enumerated keys. Thus, I looked for a solution to make the values itself keys of the array. (Otherwise, the in operator cannot be used for search.) I found an elegant solution in the accepted answer of SO: Check if array contains value.

      2. I implemented the proposed pattern <trans*> as /<trans( [^>]*)?/ which will even match <trans> (although <trans> seems never to occur without attributes) but not <transSet>.

      3. The
        buffer = buffer"\n"$0
        appends the current line to the previous contents. The $0 contains the line without the newline character. Thus, it has to be re-inserted. How I did it, the buffer starts with a newline but the last line ends without. Considering that the print buffer adds a newline at the end of text this is fine for me. Alternatively, the above snippet could be replaced by
        buffer = buffer $0 "\n"
        or even
        buffer = (buffer != "" ? buffer"\n" : "") $0.
        (It's a matter of taste.)

      4. The filtered file is simply printed to standard output channel. It might be redirected to a file. Considering this, I formatted the additional/debug output as XML comment.

      5. If your are a little bit familiar with awk you may notice that there isn't any next statement in my script. This is by intention. In other words, the order of rules is well-chosen so that a line may be processed/affected consecutively by all rules. (I tested an extreme case:
        <trans><trpcAuthCode>111222</trpcAuthCode></trans>
        and even this is processed correctly.)

      To simplify testing I added a wrapper bash script filter-trpcAuthCode.sh

      #!/usr/bin/bash
      # uncomment next line for debugging
      #set -x
      # check command line arguments
      if [[ $# -ne 2 ]]; then
        echo "ERROR: Illegal number of command line arguments!"
        echo ""
        echo "Usage:"
        echo $(basename $0) " XML_FILE AUTH_CODES"
        exit 1
      fi
      # call awk script
      awk -v authCodes="$(cat <$2)" -f filter-xml-trpcAuthCode.awk "$1"
      

      I tested the scripts (with bash in cygwin on Windows 10) against your sample file main.xml and got four matching blocks. I was a little bit concerned about the output because in your sample output transaction_results.xml are only three matching blocks. But checking my output visually it seems to be appropriate. (All four hits contained a matching <trpcAuthCode> element.)

      I reduced your sample input a little bit for demonstration sample.xml:

      <?xml version="1.0"?>
      <transSet periodID="1" periodname="Shift" longId="2017-04-27" shortId="052" site="12345">
        <trans type="periodClose">
          <trHeader>
          </trHeader>
        </trans>
        <printCashier>
          <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
        </printCashier>
        <trans type="printCashier">
          <trHeader>
            <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
            <posNum>101</posNum>
          </trHeader>
        </trans>
        <trans type="journal">
          <trHeader>
          </trHeader>
        </trans>
        <trans type="network sale" recalled="false">
          <trHeader>
            <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
          </trHeader>
          <trPaylines>
            <trPayline type="sale" sysid="1" locale="DOLLAR">
              <trpCardInfo>
                <trpcAccount>1234567890123456</trpcAccount>
                <trpcAuthCode>532524</trpcAuthCode>
             </trpCardInfo>
            </trPayline>
          </trPaylines>
        </trans>
        <trans type="network sale" recalled="false">
          <trHeader>
            <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
          </trHeader>
          <trPaylines>
            <trPayline type="sale" sysid="1" locale="DOLLAR">
              <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
              <trpAmt>61.77</trpAmt>
              <trpCardInfo>
                <trpcAccount>2345678901234567</trpcAccount>
                <trpcAuthCode>111222</trpcAuthCode>
              </trpCardInfo>
            </trPayline>
          </trPaylines>
        </trans>
        <trans type="periodClose">
          <trHeader>
            <date>2017-04-27T23:50:17-04:00</date>
          </trHeader>
        </trans>
        <endTotals>
          <insideSales>445938.63</insideSales>
        </endTotals>
      </transSet>
      

      For the other sample input I simply copied the text into a file authCodes.txt:

      111222
      111333
      111444
      

      Using both input files in the sample session:

      $ ./filter-xml-trpcAuthCode.sh
      ERROR: Illegal number of command line arguments!
      
      Usage:
      filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES
      
      $ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
      <!-- authCodeList:
      111222
      111333
      111444
      -->
      
        <trans type="network sale" recalled="false">
          <trHeader>
            <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
          </trHeader>
          <trPaylines>
            <trPayline type="sale" sysid="1" locale="DOLLAR">
              <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
              <trpAmt>61.77</trpAmt>
              <trpCardInfo>
                <trpcAccount>2345678901234567</trpcAccount>
                <trpcAuthCode>111222</trpcAuthCode>
              </trpCardInfo>
            </trPayline>
          </trPaylines>
        </trans>
      
      $ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt >output.txt
      
      $
      

      The last command re-directs output to a file output.txt which may be inspected or processed afterwards.