且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

spark 将函数应用于并行列

更新时间:2023-11-15 20:28:52

每列的值可以独立于其他列计算

values for each column could be calculated independently from other columns

虽然确实如此,但对您的情况没有真正的帮助.您可以生成多个独立的DataFrames,每个都有自己的添加项,但这并不意味着您可以自动将其合并为一个执行计划.

While it is true it doesn't really help your case. You can generate a number of independent DataFrames, each one with its own additions, but it doesn't mean you can automatically combine this into a single execution plan.

handleBias 的每个应用程序将您的数据混洗两次,输出 DataFrames 与父 DataFrame 的数据分布不同.这就是为什么当您折叠列的列表时,每个添加都必须单独执行.

Each application of handleBias shuffles your data twice and output DataFrames don't have the same data distribution as the parent DataFrame. This is why when you fold over the list of columns each addition has to be performed separately.

理论上您可以设计一个可以这样表达的管道(使用伪代码):

Theoretically you could design a pipeline which can be expressed (with pseudocode) like this:

  • 添加唯一 ID:

  • add unique id:

df_with_id = df.withColumn("id", unique_id())

  • 独立计算每个df并转换为格式:

    dfs = for (c in columns) 
      yield handle_bias(df, c).withColumn(
        "pres", explode([(pre_name, pre_value), (pre2_name, pre2_value)])
      )
    

  • 合并所有部分结果:

  • union all partial results:

    combined = dfs.reduce(union)
    

  • 从长格式转换为宽格式的枢轴:

  • pivot to convert from long to wide format:

    combined.groupBy("id").pivot("pres._1").agg(first("pres._2"))
    

  • 但我怀疑是否值得大惊小怪.您使用的进程非常繁重,并且需要大量的网络和磁盘 IO.

    but I doubt it is worth all the fuss. The process you use is extremely heavy as it is and requires a significant network and disk IO.

    如果总级别数 (sum count(distinct x)) for x in columns)) 相对较低,您可以尝试使用例如 aggregateByKey 单次计算所有统计数据Map[Tuple2[_, _], StatCounter] 否则考虑下采样到可以在本地计算统计数据的级别.

    If number of total levels (sum count(distinct x)) for x in columns)) is relatively low you can try to compute all statistics with a single pass using for example aggregateByKey with Map[Tuple2[_, _], StatCounter] otherwise consider downsampling to the level where you can compute statistics locally.