如何使用相等的记录分割Spark数据框

更新时间：2023-11-18 23:31:28

在我的情况下，我需要平衡(相等大小)的分区才能执行特定的交叉验证实验.

In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.

为此，您通常:

随机化数据集
应用模运算将每个元素分配给一个折叠(分区)

完成此步骤后，您将必须使用filter提取每个分区，但afaik仍然没有将单个RDD分成多个分区的转换.

After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.

这是scala中的一些代码，它仅使用标准的spark操作，因此应易于适应python:

Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:

val npartitions = 3

val foldedRDD = 
   // Map each instance with random number
   .zipWithIndex
   .map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
   // Random ordering
   .sortBy( t => (t._1(m_classIndex), t._3) )
   // Assign each instance to fold
   .zipWithIndex
   .map( t => (t._1, t._2 % npartitions) )

val balancedRDDList =  
    for (f <- 0 until npartitions) 
    yield foldedRDD.filter( _._2 == f )

上一篇 : ：Xamarin Android 10 安装 APK - 找不到处理意图的活动下一篇 : Spark数据框-窗口功能-滞后插入与插入引线更新输出

如何使用相等的记录分割Spark数据框

相关阅读

推荐文章