且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark数据帧随机分割

更新时间:2023-11-18 22:26:04

TL; DR 如果要拆分DataFrame,请使用

TL;DR If you want to split DataFrame use randomSplit method:

ratings_sdf.randomSplit([0.6, 0.2, 0.2])

您的代码在多个级别上都是错误的,但是有两个基本问题使其无法修复:

Your code is just wrong on multiple levels but there are two fundamental problems that make it broken beyond repair:

    可以任意次数评估
  • Spark转换,并且您使用的函数应该是参照透明的且无副作用.您的代码多次评估split_sdf,并且您使用有状态的RNG data_split,因此每次结果都不相同.

  • Spark transformations can be evaluated arbitrary number of times and functions you use should be referentially transparent and side effect free. Your code evaluates split_sdf multiple times and you use stateful RNG data_split so each time results are different.

这会导致您描述一个行为,其中每个孩子看到父RDD的不同状态.

This results in a behavior you describe where each child sees different state of the parent RDD.

您没有正确初始化RNG,因此获得的随机值不是独立的.

You don't properly initialize RNG and in consequence random values you get are not independent.