
且构网 - 分享程序员编程开发的那些事


更新时间:2023-01-30 19:59:50

如上所述,您所需的节点位于XML的不同级别,因此每个数据项的路径表达式都不同.另外,您需要遍历两个重复级别: SalesToRecordCompanyByTerritory ReleaseTransactionsToRecordCompany .

As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.


Therefore, consider parsing in nested for loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame() constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.


Below uses chained find() calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr and child rls objects.

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("file.xml")

data = []
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):

    for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):

        inner = {}

        inner['IRC'] = rls.find('./ReleaseId/ISRC').text    
        inner['IRC2'] = rls.find('./ReleaseId/ICPN').text

        # CHILDREN
        inner['Artist'] = rls.find('WMGArtistName').text
        inner['Song'] = rls.find('WMGTitle').text

        inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text    
        inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text

        # PARENT
        inner['TerritoryCode'] = terr.find('./TerritoryCode').text


df = pd.DataFrame(data)


You can shorten the find() chains and long relative paths with absolute paths using .//:

inner['IRC'] = rls.find('.//ISRC').text    
inner['IRC2'] = rls.find('.//ICPN').text

inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text