且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Python将数据从XML树提取到pandas/csv中

更新时间:2023-01-30 19:59:50

如上所述,您所需的节点位于XML的不同级别,因此每个数据项的路径表达式都不同.另外,您需要遍历两个重复级别: SalesToRecordCompanyByTerritory ReleaseTransactionsToRecordCompany .

As mentioned, your needed nodes are at different levels of the XML and hence path expressions will be different for each data item. Additionally you need to traverse between two repeating levels: SalesToRecordCompanyByTerritory and ReleaseTransactionsToRecordCompany.

因此,请考虑在嵌套的for循环中进行解析.与其在循环内增加数据框架,不如构建一个字典列表,您可以将其传递给循环外熊猫的DataFrame()构造函数.通过这种方法,您可以将字典键迁移为列,将元素迁移为数据.

Therefore, consider parsing in nested for loops. And rather than growing a data frame inside a loop, build a list of dictionaries that you can pass into pandas' DataFrame() constructor outside of the loop. With this approach, you migrate dictionary keys as columns and elements as data.

以下使用链式find()调用,较长的相对路径或较短的绝对路径来导航嵌套级别并检索相应的元素文本值.请注意,所有解析都是相对于具有父对象terr和子对象rls的循环节点的.

Below uses chained find() calls, long relative, or short absolute paths to navigate down the nested levels and retrieve corresponding element text values. Notice all parsing are relative to looped nodes with parent terr and child rls objects.

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse("file.xml")

data = []
for terr in tree.findall('.//SalesToRecordCompanyByTerritory'):

    for rls in terr.findall('.//ReleaseTransactionsToRecordCompany'):

        inner = {}

        # DESCENDANTS
        inner['IRC'] = rls.find('./ReleaseId/ISRC').text    
        inner['IRC2'] = rls.find('./ReleaseId/ICPN').text

        # CHILDREN
        inner['Artist'] = rls.find('WMGArtistName').text
        inner['Song'] = rls.find('WMGTitle').text

        # DESCENDANTS
        inner['Units'] = rls.find('./SalesTransactionToRecordCompany/SalesDataToRecordCompany/GrossNumberOfConsumerSales').text    
        inner['PPD'] = rls.find('Deal').find('AmountPayableInCurrencyOfAccounting').text

        # PARENT
        inner['TerritoryCode'] = terr.find('./TerritoryCode').text

        data.append(inner)

df = pd.DataFrame(data)

您可以使用.//缩短带有绝对路径的find()链和较长的相对路径:

You can shorten the find() chains and long relative paths with absolute paths using .//:

inner['IRC'] = rls.find('.//ISRC').text    
inner['IRC2'] = rls.find('.//ICPN').text

inner['PPD'] = rls.find('.//AmountPayableInCurrencyOfAccounting').text
inner['Units'] = rls.find('.//GrossNumberOfConsumerSales').text